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Abstract 

This paper uncovers and explores the close relationship between Monte Carlo Optimization 
of a parametrized integral (MCO), Parametric machine-Learning (PL), and 'blackbox' or 
'oracle'-based optimization (BO). We make four contributions. First, we prove that MCO is 
mathematically identical to a broad class of PL problems. This identity potentially provides 
a new application domain for all broadly applicable PL techniques: MCO. Second, we 
introduce immediate sampling, a new version of the Probability Collectives (PC) algorithm 
for blackbox optimization. Immediate sampling transforms the original BO problem into 
an MCO problem. Accordingly, by combining these first two contributions, we can apply 
all PL techniques to BO. In our third contribution we validate this way of improving BO 
by demonstrating that cross-validation and bagging improve immediate sampling. Finally, 
conventional MC and MCO procedures ignore the relationship between the sample point 
locations and the associated values of the integrand; only the values of the integrand at 
those locations are considered. We demonstrate that one can exploit the sample location 
information using PL techniques, for example by forming a fit of the sample locations to the 
associated values of the integrand. This provides an additional way to apply PL techniques 
to improve MCO. 



1. Introduction 



This paper uncovers and explores some aspects of the close relationship between Monte 
Carlo Optimization of a parametrized integral (MCO), Parametric machine Learning (PL), 
and 'blackbox' or 'oracle-based' optimization (BO). We make four primary contributions. 
First, we establish a mathematical identity equating MCO with PL. This identity poten- 
tially provides a new application domain for all broadly-applicable PL techniques, viz., 
MCO. 

Our second contribution is the introduction of immediate sampling. This is a new 
version of the Probability Collectives (PC) approach to blackbox optimization. PC encom- 
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as special cases. However PC is broader and more fully motivated. This 
means it uncovers (and overcomes) formal shortcomings in those other approaches. 

In the immediate sampling version of PC the original BO problem is transformed into an 
MCO problem. In light of our first contribution, this means we can apply PL to immediate 
sampling. In this way all PL techniques — including cross-validation, bagging, boosting, 
active learning, stacking, and others — can be applied to blackbox optimization. 
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In our third contribution we experimentally explore the power of this identity between 
MCO and PL. In these experiments we demonstrate that cross-validation and bagging 
improve the performance of immediate sampling blackbox optimization. In particular, in 
these experiments we show that cross-validation can be used to adaptively set an 'annealing 
schedule' for blackbox optimization using immediate sampling without any extra calls to 
the oracle. In some cases, we show that this adaptively formed annealing schedule results 
in better optimization performance than any exponential annealing schedule]^ 

Finally, conventional MC and MCO procedures ignore the relationship between the 
sample point locations and the associated values of the integrand. (Only the values of 
the integrand at the sample locations are considered by such algorithms.) We end by 
exploring ways to use PL techniques to exploit the information in the sample locations, for 
instance, by Bayesian fitting of a surface from the sample locations to the associated values 
of the integrand. This constitutes yet another way of applying PL to MCO in general, and 
therefore to BO in particular. 

1.1 Background on PL, MCO, Blackbox Optimization, and PC 

We begin by sketching the four disciplines discussed in this paper: 



1. 



A large number of parametric machine-learning problems share the following two properties. 
First, the goal in these problems is to find a set of parameters, 9, that minimizes an integral of 
a function that is parametrized by 9. Second, to help us find that 9 we are are given samples 
of the integrand. These problems reduce to a sample-based search for the 9 that we predict 
minimizes the integral. We will refer to problems of this class as Parametric Learning (PL) 
problems. 

An example of PL is parametric supervised learning, where we want to find an optimal 
predictor or regressor zg that minimizes the associated expected loss, Jdxdy P(x)P{y \ 
x)L[y, Z0{x)], where x's are inputs and y's are outputs. We do not, however, know P{x)P{y \ 
x). Instead, we are provided a training set of samples of P{x)P{y \ x). The associated PL 
problem is to use those samples to estimate the optimal 9. 



MCO is a technique for solving problems of the form argmin^ J dw U(w, (j)) (see Ermoliev and 
Norkin 19981. MCO starts by replacing that integral with an importance-sample generated 
estimate of it. That estimate is a sum parametrized by (p. In MCO one searches for the value 
(j) that minimizes this sum; the result of this search is one's estimate of the (f) that optimizes 
the original integral. 

3. Blackbox optimization algorithms are ways to minimize functions of the form G : X —tM. when 
one does not actually know the function G. Such algorithms work by an iterative process in 
which they first select a query x G X, and then an 'oracle' returns to the algorithm a (poten- 
tially noise-corrupted) value G{x), and no other information, in particular, no gradient infor- 
mation. The difference between one blackbox optimization algorithm and another is how they 
select each successive query based on the earlier responses of the oracle. Examples of blackbox 
optimization algorithms are genetic algorithms (Mit chell] [1996 1, simulated anne aling (Kirk 



Patrick et al 



Montgomery 



1983), hill-climbing algorithms, Response-Surface Methods (RSMs) ( [Myers and 



2002), and some forms of Sequential Quadratic Programming (SQP) (Gill et al 



(CE) method (Rubinstein and Kroese 20041, and others 



1981[ [Nocedal and Wright[ |1999[), Estimation of Distribution Algorithms (EDAs)( |De Bonet 
et al. [1997 Larraaga and Lozano 2001 Lozano et al. , 2005 1, tabu search, the Cross Entropy 



1. Since they are special cases of PC, presumably we could similarly apply PL techniques to improve EDA's 
or the CE method. 



2 



4. PC is a set of techniques that can be used for blackbox optimization. Broadly speaking, 
PC works by transforming a search for the best value of a variable x into a search for the 



best probability distribution over the variable, q{x) (see 
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1. Once one solves for the optimal q{x), inversion to get the optimal x for 



the original search problem is stochastic; one simply samples q. As described below, PC has 
many practical strengths, and is related to RSMs, ED As, and the CE method. 



1.2 Roadmap of This Paper 

We make four primary contributions: 

1. Sec. [2] begins with a detailed review of MCO and PL. Conventional analysis of Monte Carlo 
estimation involves a bias-variance decomposition of the error of the estimator of a particular 
integral. We show that for MCO, a full analysis requires more than simply extending such 
bias- variance analysis separately to each of the estimators given by the separate (/)'s. Moments 
coupling the errors of the estimators for the separate 0's must also be taken into account. How 
should we do that? 

To answer this, we note that in a different context, the techniques of PL take such coupling 
moments into account, albeit implicitly. This leads us to explore the relation between MCO 
and PL. This in turn leads to our first major contribution, the proof that MCO is identical 
to PL. This contribution means that one can apply all PL techniques, for instance, cross- 
validation, bagging, boosting, stacking, active learning and others, to MCO. Such PL-based 
MCO (PLMCO) provides a new way of minimizing potentially high-dimensional parametrized 
integrals. 

Experimentally testing the utility of applying PL to MCO requires an MCO application do- 
main. Here we choose the domain of blackbox optimization. To establish how blackbox 
optimization is an application domain for MCO requires our second contribution, as follows. 

2. We start in Sec.[3]by presenting an overview of previous versions of the blackbox optimization 
approach of PC. We then make our second contribution in the following section, where we 
introduce immediate sampling, a new version of PC that overcomes some of the limitations of 
previous versions. 

These first two contributions are combined by the fact that immediate sampling is a special 
case of MCO. The resultant identity between PL and immediate sampling means that, in 
principle, any PL technique can be applied to blackbox optimization. In particular, regular- 
ization, cross-validation, bagging, active learning, boosting, stacking, kernel machines, and 
others, can be 'cut and paste' to do blackbox optimization. This use of PL for blackbox 
optimization constitutes a new application domain for PL. 



In Sec. 4.5 we present some concrete instances of how to modify immediate sampling to use 
PLMCO rather than conventional MCO. It is important to note that when applied (via im- 
mediate sampling) to blackbox optimization, these PL techniques do not require additional 
calls to the oracle. For example, using cross-validation to set regularization parameters in 
immediate sampling (the equivalent of an annealing schedule in SA) does not involve running 
the entire blackbox optimization algorithm with different regularization schedules. As an- 
other example, using bagging in immediate sampling does not mean running the optimization 
algorithm multiple times based on different subsets of the sample points found so far. 

3. Our third contribution is to experimentally demonstrate in Sec. [5] that PLMCO substantially 
outperforms conventional MCO when used this way for blackbox optimization. We are partic- 
ularly interested in blackbox optimization problems where calls to the oracle are the primary 
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expense. Accordingly, non-oracle, 'offline' computation is considered free. So in our experi- 
ments we compare algorithms based on the values of G found by the algorithms versus the 
associated number of calls to the oracle]^ In particular, we show that bagging and cross- 
validation leads to faster blackbox optimization on two well-known benchmark problems for 
continuous nonconvex optimization. 

It should be emphasized that these experiments are not intended to investigate whether 
PLMCO applied to immediate sampling is superior to other blackbox optimization algorithms. 
Rather their purpose is to investigate whether one can indeed leverage the formal connection 
between PL and MCO to improve immediate sampling. Accordingly, these experiments are 
on toy domains, and we do not compare performance with other blackbox optimization algo- 
rithms. We leave such comparisons to future papers. 

In estimating the value of an integral based on random samples of its integrand, conventional 
MC and MCO techniques ignore how the locations of the sample points are related to the 
associated values of the integrand. Such techniques concentrate exclusively on those sample 
values of the integrand that are returned by the oracle. However, one can use the sample 
locations and associated integrand values to form a supervised learning fit to the integrand. 
In principle, such a fit can then be used to improve the overall estimate of the integral. 

In 'fit-based' MC and MCO one uses all the data at hand to fit the integrand and then uses 
that fit to improve the algorithm. In this paper, we concentrate on situations where the data at 
hand consist only of sample locations and the associated values of the integrand, but in other 
situations the data at hand may also include information like the gradient of the integrand at 
the sample points. In their most general form, fit-based MC and MCO include techniques to 
exploit such information. 

One natural Bayesian approach to fit-based MC uses Gaussian processes. Work adopting this 
approach, for the case where the data only contain sample locations and associated integrand 
values, is reviewed in |Rasmussen and Gharamani (20031. In Sec.|6]we generalize that work 



on fit-based MC, e.g., to allow non-Bayesian approaches. In that section, we also consider 
fit-based MCO in general, and fit-based immediate sampling in particular. 

One of the ways cross-validation is used in these experiments is to set a regularization 
parameter. In immediate sampling, the regularization parameter plays the same role as 
the temperature does in simulated annealing. So intuitively speaking, our results show 
how to use cross-validation to set an annealing schedule adaptively for blackbox optimiza- 
tion, without extra calls to the oracle. We show in particular that such auto-annealing 
outperforms the best-fit exponential annealing schedule. 

There are more topics involving the connection between MCO, immediate sampling 
and PL, than can be explored in this single paper. One such topic is how to incorporate 
constraints on x in immediate sampling. Another important topic involves a derivation 
from first principles of the objective function used in immediate sampling. These two 
topics are briefly discussed in the appendices. Some other topics are mentioned, albeit 
even more briefly, in the conclusion. 



1.3 Notation 



As a point of notation, we will use the term 'distribution' to refer either to a probability 
distribution or a density function, with the associated Borel field implicitly fixing the 
meaning. Similarly, we will write integrals even when we mean sums; the measure of the 



2. See 



Wolpert and Macready (19971; 



Droste et al. 



(20021; Wolpert and Macready (20051; 



Knowles ( 2003 1; Igel and Toussaint ( 2004 1; Schumacher et al. ( 2001 1 for a discussion of the mathematics 



relating algorithms under such performance measures. 
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integral is implicitly taken to be the one appropriate for the the argument. We will use Q 
to indicate the Heaviside or indicator function, which is 1 if its argument is positive, and 
otherwise. 

We will use ^ to mean the set of all distributions over X. We are primarily interested 
in X's that are too large to permit computations involving all members of Accordingly, 
we will will work with parametrized subsets ^ C 3^. We generically write that (possibly 
vector- valued) parameter as 9, and write the element of ^ specified by 9 as qg. We use E 
to indicate the expectation of a random variable. Subscripts on E are sometimes used to 
indicate the distribution(s) defining the expectation. 

We take any oracle to be an cc-indexed set of independent stochastic processes, and 
use the symbol g to indicate the generic output of the oracle in response to any query. With 
some abuse of notation, we denote the output of the oracle for query x as P{g \ x,'^). For 
a noise-free, or single-valued oracle, we write P{g \ x^'^) = S(g — G{x)) for some function 
G implicitly specified by where S{.) is the Dirac delta function. 

When there is both a factual version of a random variable and a posterior distribution 
over counter-factual values of that variable, they must be distinguished. In general this 



requires extending the conventional Bayesian formalism (see Wolpert 1997 19961. Here, 
though, it suffices for us to indicate counter-factual values by a subscript c. Say there 
is a factual oracle and we are provided a data set D formed by sampling 5^. We use 
superscripts to denote different samples in that data set. Then D in turn induces a posterior 
over oracles, and we write that posterior as P(^^c | D). 

2. MCO and PL 

In this section we review PL and MCO show that they are mathematically identical. 

2.1 Overview of PL 

A broad class of parametric machine learning problems try to find 



(PI): argmin^ J dx P{x)Rx{0- 



For subsequent purposes, it will be useful to write a; as a subscript and ^ as an argument 
of R, even though x is the integration variable and ^ is the parameter being optimized. 
To perform this minimization, we have a set of function values D = {Rx^iS.)}i where we 
typically assume that the samples x'' ,i ^ 1, . . . , N were formed by IID sampling of P{x). 
The maximum likelihood approach to this minimization first makes the approximation 



(dxP{x)Rx{0 « ^E^-'(^)' 



One then solves for the ^ minimizing the sum, and uses this as an approximation to the 
solution to PI. In practice, though, this procedure is seldom used directly: although 
the approximation in Eq. fllis unbiased for any fixed ^, min^ R^iO is not an unbiased 
estimate of min^ J dx P{x)Rx{£,). Therefore, when this approximation is exploited, it is 
modified to incorporate bias-reduction techniques. 
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Example: Parametric Supervised Learning: 

Let X,Y be input and output spaces, respectively. Let L{y^,y^) : Y x Y ^ R he a, loss 
function, and : X — > K be a ^-parametrized set of functions. In parametric machine 
learning with IID error our goal is to solve 



argmin^ J dx P{x) J dy P{y \ x)L{y, z^{x)), 

argmin^ J dx P{x)Rr^{S^). 

Intuitively, Rx{£,) is the expected loss at x for the 'fit' z^{x) to the a;-indexed set of distri- 
butions P{y I x). 

To perform this minimization we have a training set of pairs D = {x'^,y''},i = 
1^ . . . , N, that we assume were formed by IID sampling of P{x)P{y \ x). The maximum 
likelihood approach to this minimization first makes the approximation 

j dx P{x) I dy P{y I x)L{y, z^{x)) « ^ E ^^V^' ^«(^'))' 

i 

One then solves for the ^ minimizing the sum, and uses this as an approximation to the 
solution to (PI). As discussed above, in practice, this minimization is rarely used directly, 
and is usually combined with a bias-reducing technique like cross-validation. 

2.2 Overview of MCO 

Consider the problem 

(P2) : ai-gmin^^^ J dw U{w,4>). 
For now, we do not impose constraints on 0, nor restrict <I>. Monte Carlo Optimiza- 



tion (Ermoliev and Norkin 19981 is a way to search for the solution of (P2). In MCO we 



use importance sampling to rewrite the integral in (P2) as 

dw U{w, (j)) = [ dw v{w) ^ ' '^^ 



v{w) 
dw v{w)ry^u^w{(j)), 

for some sampling distribution v. Following the usual importance sampling procedure, we 
IID sample v to form a sample set {U{w^,.) : i = 1,...A^}, which specifies a set of N 
sample functions 

It is implicitly assumed that for any w, we can evaluate v{w) up to an overall normalization 
constant. 
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In MCO, these N functions are used in combination with any prior information to 
estimate the solution to (P2). Conventionally, this is done by approximating the solution 
to (P2) with the solution to the problem 

(P3) : argmin^^r* ((/)). 

We define 

(0) 

0i),;7,{u.'}(0) 

For notational simplicity, the subscripts will usually be omitted in these expressions. We 
will use the term naive MCO to refer to solving (P3) by minimizing ^{4>). 



= J dw U {w, (p), 

i 

= argmin[^„^[7r^ii((/))]. 



2.3 Statistical Analysis of MCO 



The statistical analysis of MC estimation of integrals is a relatively mature field (see Robert 

[l996j ). 



and Casella 2004 Fishman 



We now show that when such MC estimation is com- 



bined with parameter optimization in MCO, the analysis becomes much more involved. 



2.3.1 Review: MC Estimation 

First consider MC estimation, with no mention of MCO. We first need to to specify a loss 
function L(., .) that will couple our mathematics with real-world costs. The first argument 
of such an L is the output of the estimation algorithm under consideration. The second 
argument is the quantity statistically sampled by that algorithm. The associated value of 
L is the cost if the algorithm produces the output specified in that first argument, using 
the quantity specified in the second argument. 

As an example, consider importance-sampled MC estimation of an integral. Using the 
MCO notation just introduced, we use ^(0) as an estimate of ^{(j)) for some fixed 0. The 
quantity being sampled is the function J7(.,0), and the output of the algorithm is ^(0). 
Accordingly, these are the arguments of the loss function. 

The most popular loss function in statistical analysis of MC integral estimation is 
quadratic loss, given below. 

L(i^(0),f/(.,0)) ^ [J^{(t>)- j dwU{w,<t>)f. 

Unless explicitly stated otherwise, we will henceforth use the term 'expected loss' to refer 
to the average of this loss function over sample sets. Since ^{(f) is an unbiased estimate 
of .if (0), the expected loss is the sample variance, 

Var(i^(0)) = E([f:^^f^]^) - [E(E^S^)]' 
V x-rn Nv(w^) ' ' ^ Nv(w^) 

= -^r{ldwv{w)[ — — — ] - [ dwv{w) — ^^^] } 
iV J v[w) J v[w) 
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This expansion for the sample variance is quite useful. For example, one can solve for the 
V that minimizes this variance (and therefore minimizes expected loss) as a function of 
U{.,(f>). For nowhere-negative U, that optimal v is given by (see |Robert and Casella 20041 



v{w) 



J dw'U{w',(t))' 



Given the formula for the optimal v, one can estimate it from a current sample set, and 
then use the estimated optimal v for future sampling. This is what is done in the VEGAS 
Algorithm (Lepage 1978 19801. Consideration of the sample variance has also led to 



algorithms that partition X and then run importance sampling on each partition element 
separately, for instance, stratified sampling (Fishman 19961. MC estimators that do not 



use strict importance sampling may introduce bias. However, if the variance is sufficiently 
reduced, expected quadratic error is reduced. This can be exploited to tradeoff bias and 



variance. 



2.3.2 From MC to MCO 

When we combine MC with parameter optimization in MCO, quantities like Var(^((/))) 
for one particular </> are not the main objects of interest. Instead, we are interested in 
expected loss of our iterated MCO algorithm, which involves multiple (p's. So what is the 
appropriate loss function for analyzing MCO? From the very definition of (P2), it is clear 
that we want L{(j), U) to be minimized by the that minimizes J dw U (w, cj)). The simplest 
approach to doing this, which will be assumed from now on, stipulates that 

L{(I),U) = = [dwU{w,(j)), (4) 



the same integral appearing in (P2). If we can solve (P2) exactly, then we will have 
produced the (f> with minimal value of this loss function. 

Given this choice of loss function, expected loss in naive MCO is 

N 

E{L \U,v) = dw'... dw^ Y[v{w')^i&rgmm^[.^,^u.{^.y{cj))]) 

i=i 

= [dw'...dw^ ]Jv{w^) [dw'U{w',^vgmm^[J2^^^f^ (5) 

The optimal v for naive MCO is the one that minimizes E(L | C/, f). There is no direct 
relation between this v an d the one that minimizes loss for some single </). In stark contrast 



to the MC analysis in Sec. 2.3.1 in addition to the sample variance Var(^((^)) for any single 
(f), the expected loss E(L | U, v) now also depends on moments coupling the di stribut ions 
of ^{(p) for different ip's. Loosely speaking, the bias-variance tradeoff in Sec. 



2.3.1 



now 

becomes a more complicated bias-variance-covariance tradeoff. Now, setting w is more 
involved, but we can approach it as follows. 

Expressing the expected loss slightly differently gives us an important insight. Note 
that each sample set {w^} gives rise to an associated set of estimates for all G <&. Call this 
(possibly infinite dimensional) vector of estimates I, each of whose components is indexed 
by (f> and is an estimate for that particular (j). Now, instead of computing expected loss by 
averaging over all possible sample sets, we average over all possible vectors I. In order to 
do this, we need to specify the probability of each vector /. Define 
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So, TT^ [7 4.(0 is the probability of a set of sample points {w^} such that for each the 
associated empirical estimate ''f,;7,M''('^) equals the corresponding component of I. For 
notational simplicity the subscripts of tt^ „ ^ will sometimes be omitted. We can now write 
Eq. [5] succinctly as 

E{L\U,v) = J dlTT{l) ^{a.rgmm^[l^]) 

where ^aigmin^ll^y means the index (j) of the smallest component of /. The risk is the 
difference between this expected loss and the lowest possible loss. We can write that risk 
as 

J dl TTy^uM^) [■^(ai'gmin^[/^]) - mm^[^{(j))]]. 

Our sample set constitutes a set of samples of 7r„^(7^$ occurring in Eq. l6]This fact can 
potentially be exploited to dynamically modify v and/or $ to reduce E{L\U,v). Indeed, 
for the simpler case of MC estimation, this is essentially the kind of computation done in 
the VEGAS algorithm mentioned above. As a practical issue, it may be difficult to update 
V and/or $ using the full formula Eq. [s] Instead, one could approximate that formula 
E(L I U,v) near a single (p of interest, e.g., about a current estimate for the optimal </). 

Intuitively though, one would expect that for a fixed set of (j)'s, everything else being 
equal, it would be advantageous to have small variances of unbiased estimators and large 
covariances between them. Such considerations based on the second moments may help 
one choose quantities like the sampling distribution v. 

Such considerations may also help one choose the set of candidate 0's, <i>. For example, 
one way to have large covariances between the € <& is to have the associated functions 
over w, {U{., (p) : (j) ^ <&}, all lie close to one another in an appropriate function space (e.g., 
according to an Zoo norm comparing such functions). However, choosing such a $ will tend 
to mean there is a small 'coverage' of that set of functions, {U (., 4>) : cj) E More precisely, 
it will tend to prevent the best of those (/)'s from being very good; min0g$[J dw U{w, cf))] 
will not be very low. 

This illustrates that, in choosing (f>, there will be a tradeoff between two quantities: 
The first quantity is the best possible performance with any of the The second 

quantity is the risk, that is, how close a given MCO algorithm operating on <i> is likely 
to come to that best possible performance of a member of Choosing $ to have large 
covariances of the (MC estimators based on the) members of and in particular to have 
large covariances with the truly optimal (j), argmin0g$^(0), will tend to result in low risk. 
But it will also tend to result in poor best-possible performance over all g <I>. 

Similarly, one would expect that as the size of $ increases, there would be a greater 
chance that a sample set for one of the suboptimal e <f> would have low expected loss 
'by luck'. This would then mislead one into choosing that suboptimal 0. So increasing the 
size of <i> may increase risk. However increasing <i>'s size should also improve best possible 
performance. So again, we get a tradeoff. 

It may be that such considerations involving the size of <i> and the covariances of its 
members can be encapsulated in a single number, giving an 'effective size' of (somewhat 
analogously to the VC dimension of a set of functions). Such tradeoffs are specific to the 
use of MCO, and do not arise in plain (single-0) MC. They are in addition to the usual 
bias-variance tradeoffs, which still apply to each of the separate MC estimators. 

An illustrative example of the foregoing is provided in App. [C] A more complete sta- 
tistical analysis of risk in MCO, including Bayesian considerations, is in Sec. |6] 



MCO 


PL 


w 


X 






v{w) 


P{x) 


rv,w{4>) 




ri[4>) 


mo 



Table 1: Correspondence between PL and MCO. 



2.4 PL Equals MCO 

In MCO, we have to extrapolate from the sample set of w values to perform the integral 
minimization in Eq. [3j As discussed above, this can recast as having a set of sample 
functions </> r'^{(j}) that we want to use to estimate the (j) that achieves that minimization. 
Similarly, in PL, we have to extrapolate from a training set of functions to minimize 

the integral / dx P{x)Rx{0- Though not usually viewed this way, at the root of this 
extrapolation problem is the problem of using the sample functions ^ ^ to estimate 

the minimizer of Eq. [2] 

In addition, the analysis of Sec. |2.3| is closely related to the PL field of uniform conver- 
gence theory. That field can be cast in the terms of the current discussion as considering 
a broad class of U's, 'W . Its starting point is the establishment of bounds on how 



max„_(7g^[y 7r^^(7^$(/) 6(^(argmin^[/0]) - min^lJif {(p)] - k) (8) 

depend^l^on k. Of particular interest is how the function taking k to the associated bound 
varies with characteristics of and <I> (see Vapnik, 1982 1995). Eq.js] should be compared 



with Eq. [7] 

All of this suggests that the general MCO problem of extrapolation from a sample set 
of empirical functions to minimize the integral of Eq. [3] is, in fact, identical to the general 
PL problem of extrapolation from a training set of empirical functions to minimize the 
integral of Eq. [2] This is indeed the case. As shown in Table [l] identify ^ <-> 0, a; ^ 
w,P{x) ^ v{w),Rx{S,) ^ r„,„((^),rj,(^) ^ R'{^). Then the integrals in Eq. [s] and (PI) 
become identical. So the MCO expected loss function in Eq. |4] becomes identical to the PL 
expected loss. Similarly, the sample functions for MCO and PL become identical. 

In particular, in supervised learning, when there is no noise, P{y \ x) becomes a single- 
valued function y{x), and the parametric supervised learning problem becomes 



argmin^ J dx P{x)[L{y{x), z^{x))] 

This should be compared to the MCO problem as formulated in Eq. [3] For the same reasons 
that direct minimization of Eq. [l]is seldom used in PL, we now see that naive MCO will 
be biased, and should preferably not be used directly. 

Note that most sampling theory analysis of PL does not directly consider the biases and 
variances of the separate Monte Carlo estimators for each ^, nor does it directly consider the 
moments that couple the distributions of those estimates. Rather, it considers a different 



3. As an example, rewrite w x,(f) ^ a,v{x) P{x). Also choose to be all functions of the form 
U{w,(f)) = U{x,a) = J dy P{y j x){y — F{x,a)Y for any function F and distribution P{y \ x). Under 
this substitution, Eq. [8] becomes the archetypal uniform convergence theory problem for regression with 
quadratic loss. 
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type of bias and variance — the bias and variance of an entire algorithm that chooses a ^ 
based on associated MC estimates of expected loss (Wolpert 19971. In this sense, such PL 



analysis bypasses the issues considered in Sec. |2.3[ The bias-variance-covariance approach 
described in this section might have important implications on PL analysis of learning 
algorithms, but for the moment, in our exploration of the identity between MCO and PL, 
we simply use PL-based techniques to reduce the bias or variance of our algorithms. 



3. Review of PC 

This section cursorily reviews the previously investigated type of PC. It then briefly dis- 
cusses the advantages of PC for blackbox optimization and its relation to other blackbox 
optimization techniques. 

3.1 Introduction to PC 

To introduce PC, consider the general (not necessarily blackbox) optimization problem 

(P4) : argmin^gxE(g I x,^). 

For now, we ignore constraints on x. In PC we transform (P4) into the problem 

(P5) : argmin^ggo,^^(g9), 

for some appropriate function ^cg. After solving (P5) we stochastically invert qe to get an 
X (the ultimate object of interest), by sampling qg. This type of "randomizing transform" 
contrasts with conventional transform techniques, where inversion is deterministic. 

Ideally, should be chosen in a first-principles manner, based on exactly how qg will 
be sampled and how those samples used (see Sec. |6]). In practice though, computational 
considerations might lead one to choose heuristically. Intuitively, such considerations 
might compel us to choose J?c# both so that (P5) is readily easy to solve, and so that any 
solution qg to (P5) is concentrated about the solutions of (P4). Taking the parametrization 
to be implicit, we often abbreviate ^--^{qg) as just ^^(0). 

In many variants of PC explored to date, ^cg{9) is an integral transforrrj^ over X, 



^^{9) ^ I dxdg P{g\x,^)F{g,qg{x)). 



As an example of such an integral transform, consider optimization with a noise-free (single- 
valued) oracle, P{g \ x,'^^) = S{g — G{x)), where the transformed objective is the expected 
value of {g\x) under x ~ qg. In other words, = Egj,[G(x)]. In addition, suppose that 
£S — that is, qg can be any distribution. Under fairly weak assumptions, it can be shown 
that one solution to (P5) is given by the point-wise limit of Boltzmann distributions, 

p*{x) — lim p^{x), where p^{x) oc exp[— /3G(a;)]. 

In the case where ^ C we could choose ^^^{9) to be a measure of the dissimilarity 
between such a Boltzmann (or other) 'target' distribution, and a given qg. For instance. 



4. An instance where this is not the case is with the elite objective function, described in App. B. 
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we could use a KuUback-Leibler (KL) divergence between qg and , which we refer to as 
"p(7" KL distance: 

^^{6) = KL(/||ge) 
In terms of the quantities in Eq. [9] F{g,qg{x)) cx e~^^ln[qg{x)], up to an overall additive 



constant. So rp(g|^ 6') in Eq. 10 is the contribution to the KL distance between 
and qe given by the argument x. 

To see why this choice of ■^<^{9) is reasonable, first note that pf^{x) is large where G{x) 
is small. Indeed, as /3 ^ 00, p^ becomes a delta function about the x{s) minimizing G{x), 
that is, about the solution(s) to (P4). Now, suppose that is a broad enough class that it 
can approximate any sufficiently peaked distribution. That means that there is a, qg € ^ 
for which KL(p^ || qg) is small for large (3. In such a situation, the qg solving (P5) will be 
highly peaked about the x{s) solving (P4). Accordingly, if we can solve (P5) for large f3, 
sampling the resultant qg will result in an x with a low E(g \ x,'^^). 

3.2 Review of Delayed Sampling 

We now present a review of conventional, delayed-sampling PC. In this type of PC we 
exploit characteristics of the parametrization of qg, and pursue the algebraic solution of 
(P5) as far as possible, in closed form. At some point, if there remain quantities in this 
algebraic expression that we cannot evaluate closed-form, we estimate them using Monte 
Carlo sampling. 

As an example, consider a noise-free oracle, and instead of pq KuUback-Leibler distance, 
choose ^<^{qg) to be the expected value returned by the oracle under qg, 



^qe,'^{9) = J dxdg gP{g \ x,^)qg{x) 

' dxG{x)qg{x) (11) 



where the second equality reflects the fact that we are assuming a noise-free oracle. To 
emphasize the fact that we're considering noise- free oraclesj^ we will sometimes write 
1^98 ,^(5) = ^qg,G{g)- While Eqg^cg{g) is a linear function of qg, in general it will not be 
a linear function of 6. Accordingly, finding the 9 minimizing ¥^qg^cg{g) may be a non-trivial 
optimization problem. 

Since qg must be a probability distribution, (P5) is actually a constrained optimization 
problem, involving \X\ inequality constraints {qg { x) > Vx}, and one equality constraint. 



/ dx qg{x) = 1. As discussed by Wolpert et al. (20061, such a constrained optimization 



problem can be converted into one with no inequality constraints by the use of barrier 
function methods. These methods transform the original optimization problem into a 
sequence of new optimization problems, {(P5)'}, each of which is easier to solve than the 
original problem (P5). Solving those problems in sequence leads to a solution to the original 
problem (P5). 

Consider applying this method with an entropic barrier for the case where ^'^{qg) = 
¥.qg^c{g)- Then, it turns out that up to additive constants, each problem (P5)* is again 



5. Even though it is noise-free, the oracle G may be a random variable — we may not know G, and may 
attempt to predict it probabilistically from data, in a Bayesian fashion. In such a situation, notation 
like 'E(l^)' refers to the expected oracle under our prior distribution over oracles. So, we use £5^^(3(3) 
rather than Eqg(G'), even though the latter is the notation we used in previous work on PC. 
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of the form of (P5). However the .'^•^{qe) of each problem (P5)* is the 'gp' KL distance, 
KL(g9 II p^^), where Pi is the value of the 'barrier parameter' specifying problem (P5)^ In 
other words, up to irrelevant additive constants, each (P5)* is the problem of finding the 9 
that minimizes 

where S{.) is conventional Shannon entropjj^ In this case the barrier function method 
directs us to iterate the following process: Solve for the qg that minimizes KL{qg \\ p^"), 
and then update (ii. At the end of this process we will have a local solution to (P5). 

In the case where X is a Cartesian product, we often use distributions parametrized as 
a product distribution, qo ~ J|- qi{xi). Under this parametrization each problem (P5)' can 
be solved by gradient descent, where the gradient components of ^ciSiiqe) are given by 

^•^/f^f^ = E,„G(5|x.)+/3rMnfe(a;.)] + A. 

where the Lagrange parameters enforce normalization of each qi. 

There are many better alternative^ to simple gradient descent for minimizing each 



^G,0i iqe), involving Newton's method, block relaxation, and related techniques (Wolpert 
et al.[ |2006[ ). In all such schemes investigated to date, we need to repeatedly evaluate 



terms like Egg ^G{g \ Xi). Sometimes that evaluation can be done closed form (Macready 



and Wolpert[ |2005[ |2004[ ). In blackbox optimization though, this is not possible. 

Typically, when we cannot evaluate the terms Eq^^cid I Xi) in closed- form we use 
MC to estimate them. Since we have a product distribution, we can generate samples 
of the joint distribution qg{x) by sampling each of the marginals qi{xi) separately. One 
can use those sample x's as queries to the oracle. Then, by appropriately averaging the 
oracle's responses to those queries, one can estimate each term Eq^^cig I Xi) (Wolpert and 



Bieniawski 20041. The product factorization implies that our iterative procedure can be 
performed in a completely decentralized manner, with a separate program controlling each 
component Xi, and communicating only with the oracl^ 

In this scheme, once qg is modified, samples of the oracle that were generated from 
preceding qg\ can no longer be directly used to estimate the terms E^^ 0(5 \ Xi). However, 
there are several 'data-aging' heuristics one can employ to reuse such old data by down- 
weighting it. 

In all these schemes, while we ultimately use Monte Carlo in the PC, it is delayed as 
long as possible in the course of solving (P5). This is the basis for calling this variant of 
PC 'delayed sampling'. 

3.3 Advantages of PC 

The PC transformation can substantially alter the optimization landscape. For a noise- 
free oracle G{x), (P4) reduces to the problem of finding the x that minimizes G{x). In 
contrast, (P5) is the problem of finding the 9 that minimizes ,!^r^{9). The characteristics 

6. This qp distance is just tlie free energy of qg for Hamiltonian function G and inverse temperature Pi. 
This gives a novel derivation of the physics injunction to minimize the free energy of a system. 

7. One of these alternatives can be cast as a corrected version of the replicator dynamics of evolution- 



ary game theory (Wolpert 2004b I. This may have interesting implications for GAs, which presume 
evolutionary processes. 

Each such program may be thought of as an 'agent' who updates his probability distribution, and this 
'collective' of agents performs optimization in a decentralized manner. This led to the name 'Probability 
Collective'. 
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of the problems of minimizing G{x) and minimizing .^c^(9) can be vastly different. For 
example, suppose qg is log-concave in its parameters, and is pq KL distance. In this 
case, regardless of the function G{x), ^/^{O) is a convex function of 6, over =S. So the 
PC transformation converts a problem with potentially many local minima into a problem 
th none. See Wolpert et al. ( 2006[ ) for a discussion of the geometry of the surface 



wi 



Since it works directly on distributions, PC can handle arbitrary data types. X can be 
categorical, real-valued, integer- valued, or a mixture of all of these, but in each case, the 
distribution over X is parametrized by a vector of real numbers. This means that all such 
problems 'look the same' to much of the mathematics of PC. Moreover, PC can exploit 
extremely well-understood techniques (like gradient descent) for optimization of continuous 
functions of real-valued vectors, and apply them to problems in these arbitrary spaces. 

Optimizing over distributions can give sensitivity information: The distribution qg pro- 
duced in PC will typically be tightly peaked along certain directions, while being relatively 
flat along other directions. This tells us the relative importance of getting the value of x 
along those different directions precisely correct. 

We can set the initial distribution for PC to be a sum of broad peaks, each centered on a 
solution produced by some other optimization algorithm. Then, as that initial distribution 
gets updated in the PC algorithm, the set of solutions provided by those other optimization 
algorithms are in essence combined, to produce a solution that should be superior to any 
of them individually. 

Yet another advantage to optimizing a distribution is that a distribution can easily 
provide multiple solutions to the optimization problem, potentially far apart in X. Those 
solutions can then be compared by the analyst in a qualitative fashion. 

As discussed later, there are other advantages that accrue specifically if one uses the 
immediate-sampling variant of PC. These include the ability to reuse all old data, the ability 
to exploit prior knowledge concerning the oracle, and the ability to leverage PL techniques. 



See Wolpert et al. (2006) for a discussion of other advantages of PC, in particular in the 



context of distributed control. 



3.4 Relation to Other Work 



PC is related to several other optimization techniques. Consider, for instance. Response 

In these techniques, one uses Design of Ex- 



Surface Models (RSM)s Jones et al. (1998) 



Then, a low-order 
Optimization of 



periments (DOE) to evaluate the objective function at a set of points 
parametric function, often a quadratic, is fitted to these function values 
this 'response surface' or 'surrogate model' is considered trivial compared to the original 
optimization. The result of this surrogate optimization is then used to get more samples 
of the true objective at a different set of points. This procedure is then iterated using 
some heuristics, often in conjunction with trust-region methods to ensure validity of the 
low-order approximation. We note the similarities with PC in Table [2j 

As another example, some variants of PC exploit MC techniques as discussed above, 
and thus stochastically generate populations of samples. In their use of random populations 



these variants of PC are similar to simulated annealing (Kirkpatrick et al. 1983 1, and even 



more so to techniques like EDA's Larraaga and Lozano (2001); De Bonet et al. (1997); 



Lozano et al. (2005) and the CE method (Rubinstein and Kroese 2004). However, these 



other approaches do not explicitly pursue the optimization of the underlying distribution 
qg, as in (P5). Accordingly, those approaches cannot exploit situations in which (P5) can be 
solved without using a stochastically generated population ( Macready and Wolpert[ 2005 



2004). See Macready and Wolpert (2005) for a more extensive discussion of the relation of 



PC to other techniques. 
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XVOiVl 


PP 


Fit parametric function to 
objective function values 


Fit parametric distribution to 
target distribution 


Heuristics to grow trust region 


Cross-validation for regularization 


DOE for sample points 


Random sampling for sample points 


Axis alignment of stencil matters 


Parametrization can address axis alignment 


Surrogate minimization not always easy 


Implicit, probabilistic 'minimization' of surrogate 



Table 2: Relation to RSM. 



4. Immediate sampling 

This section introduces a new PC technique called immediate sampling, and cursorily 
compares it to delayed sampling. As we have just described, in delayed sampling, we use 
algebra for as long as possible in our solution of (P5). When closed-form expressions can 
no longer be evaluated, we resort to MC techniques. In immediate sampling, we form 
an MC sample immediately, rather than delaying it as long as possible. That sample gives 
us an approximation to our objective, ^<-^{9) for all 6 G ^. We then search for the 9 that 
optimizes that sample-based approximate objective. 



4.1 The General Immediate-sampling Algorithm 

We begin with an illustrative example. Consider an integral transform ^cf{qe), and use 
importance sampling to rewrite it as 

dx h^{x)rp(^gl^^c^)^h^{e). (13) 



where we call h^{x) the sampling distribution. Note that the r in Eq. 13 differs from 
the one defined in Eq. 10 as indicated by the extra subscript. This new r is used in the 
next section. 

We form a sample set of N pairs = {x'',g^} by IID sampling the distribution 
h^{x)P{g I x,'i^) in the integrand of Eq. 12 That sampling is the 'immediate' Monte Carlo 
process. is equivalent to a set of N sample functions 

In the simplest version of immediate sampling, we would now use the functions r^i (x*, 9), 
together with our prior knowledge (if any), to estimate the 9 that minimizes ^c^{qg). As 
an example, not using any prior knowledge, we could estimate ^<^{qg) for any 9 as 

dx/ii(x)rp(,|,,^),,i(0) « T.^rU^\Q) ^ (^4) 

This estimate is both an unbiased estimate of ^<^{qe) and the maximum likelihood estimate 
of ,:^<^{qo). Moreover, it has these attributes for all 9. (This is the advantage of estimating 
■^<^{qg) using importance sampling with a proposal distribution h that doesn't vary with 
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9. ) Accordingly, to estimate the 9 that minimizes ^(^{qe) we could simply search for the 
9 that minimizes J2i 

Once again, even though the average in Eq. 14 is an unbiased estimate of ^rg[qg) for any 



fixed 9, its minimizer is not an unbiased estimate of mine^c^(ge). This is because searching 
for the minimizing 9 introduces bias. Therefore, one should use some other technique than 



directly minimizing the righthand side of Eq. 14 to estimate argmineJ^c^(ge) 



4.2 Immediate Sampling with Multiple Sample Sets 

In general, we will not end the algorithm after forming a single sample set D^. Instead we 
will use a map 77 that takes to a new sampling distribution, h? . We then generate new 
{x^g) pairs using h? , giving us a new sample set . We then iterate this process until we 
decide to end the algorithm, at which point we use all our samples sets together to estimate 
argmine^c^(ge)- 

To illustrate this we first present an example of a ^-estimation procedure we could run 
at the end of the immediate sampling algorithm. This example is just the extension of the 
maximum likelihood ^-estimation procedure introduced above to accommodate multiple 
sample sets. Let be the total number of samples, drawn from M sample sets, with 
Nj samples in the j'th sample set. Let y be sample distribution for the j'th sample set, 
and r*'^ the sample function value for the i'th element of the j'th sample set. Also define 



is an unbiased estimate of 



= : I ^ l,...Nj}. Then E2i (^^^'^ 

■^<^{qe) for any sample set j. Accordingly, any weighted average of these estimates is an 
unbiased estimate of .'^ 



^wiqe) 



M 



i=l 



(15) 



Modulo unbiasedness concerns, we could then use the minimizer of Eq |15| as our estimate 
of a.TgToine-^'^iqe)- 

Say we have fixed on some such (^-estimation procedure to run at the end of the algo- 
rithm. The final step of each iteration of immediate sampling is to run 77, the map taking the 
samples generated so far to a new h. Ideally, we want to use the rj that, when repeatedly run 
during the algorithm, maximizes the expected accuracy of the final ^-estimation. However 
even for a simple 0-estimation procedure, determining this optimal i] can be quite difficult. 
As discussed later, it is identical to the active learning problem in machine learning. 

In this paper we adopt a two-step heuristic for setting rj. In the first step, at the end 
of each iteration, we estimate the optimal qg based on all the sample sets generated so 



far, using Eq. 15 In the second step, we complete rj by setting the new h to the current 
estimate of the optimal qg. At that point, the new h is used to generate a new sample set, 
and the process repeats. 



4.3 Immediate Sampling with MCMC 

For certain types of it is possible to form samples using other sampling methods 



hke Markov Chain Monte Carlo (MCMC) (see Mackay 2003 Bernardo and Smith 2000 



Jdx F{G(x),qg(x)) 
J dx qg{x) 



In Eq. 
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we fix the denominator 



Since qe is normalized, so is J dx F{G{x) 

integral to 1. In practice though, it may make sense to replace both of the Inte grals in this ratio with 
importance sample estimates of them. That means dividing the sum in Eq. 
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_by j:^,q{x^)/h(x^) and 
then finding the 9 that optimizes that ratio of sums, rather than the 9 that just optimizes the numerator 



term (see Robert and Casella 20041. For example, this can be helpful when one uses cross-validation to 
set /3, as described below. 
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Berger 1985 1. For example, if ^'^{9) is pq distance from the Boltzmann distribution to 



ge, then we can use MCMC to form a sample set of p^^ (not of qe). We can then use that 
sample set to form an unbiased estimate of ^£^(6*) for any 9. But if j3 were to change, these 
old samples cannot be used directly. One would have to resort to additional techniques 
like rejection sampling in order to reuse these samples. The advantage of using importance 
sampling is that all previous samples can be reused by the simple expedient of modifying 
their likelihood ratios. Therefore, in this paper, we only consider sampling distributions h 
that can be sampled directly, without any need for techniques like MCMC. 



4.4 Advantages of Immediate- Sampling PC 

In contrast to delayed sampling, immediate sampling usually presents no difficulty with 
reusing old data, as shown above; all (x''-', g*'-') pairs can be used directly. Note that we 
can also reuse data that was generated when F was different, for instance, data generated 
under a differerent /3* during a KL distance minimization procedure. As long as we store 
h^(x^'^) in addition to 5*'^ and x^'^ for every sample, we can always evaluate r^'| (x*'-', 9) for 
any F. 

Indeed, we can even comment on optimal ways of reusing this old data. Since each 
r^f^] {x^''^ ,9) is an unbiased estimate of the integral ^c^(9), any weighted average of the 



rj^j{x^'-',9)'s is also an unbiased estimate. This can be exploited in the 6'-estimation pro- 



r '-^ 

cedure. For instance, consider the estimator of Eq. 15 
variances of the individual r^f (a;*'-', 9), we can weight the terms r^'^^ (x*'^ , 9) to minimize the 
variance of the associated weighted average estimator. Those weights are proportional to 



If we have good estimates of the 



the inverses of the variances (see Macready and Wolpert, 2005: Lepage 1978, 1980) . As 
discussed in Sec. |2.3| the accuracy of the associated MCO algorithm could be expected to 
improve under such weighting. 

We can also shed light on how to go about gathering new data. As in the VEGAS 



Algorithm (Lepage 1978 19801, one could incorporate bias- variance considerations into 
the operator rj that sets the next sampling distribution. To give an example, let S be 
the range of 77, and fix 9. Given S and 9, one can ask what proposal distribution /i G S 
would minimize the sample variance of the estimator in Eq. |14[ Intuitively, this is akin to 
asking how best to do active learning. In general, the answer to this question, the optimal 
sampling distribution h{x), will be set by the function rp(g\x,w),hi9), viewed as mapping 
AT ^ M. Accordingly, for any fixed 9, one can use the MC samples generated so far to 
estimate the x-dependence of ?'p(g|a;,^),;i(^)5 smd thereby estimate the optimal /i G S. One 
then uses that estimate as the next sampling distribution h. 

Another advantage of immediate sampling over delayed sampling is that the analysis in 
delayed sampling relies crucially on the parametrization of the (7's; some such parametriza- 
tions will permit the closed-form calculations of delayed sampling, and others will not. In 
immediate sampling, this problem disappears. 



4.5 Implications of the Identity Between MCO and PL 

For the case where ,^^{9) is an integral transform like Eq. |9] the PC optimization problem 
(P5) becomes a special case of minimizing a parametrized integral, the problem (P2). 
Formally, the equivalence is made by equating x with the parameter 0, g with w, and 
g X P{g I x,'^^) with U{w, </)). In particular, immediate sampling is a special case of MCO. 
This identity means that we can exploit the extremely well-researched field of PL to improve 
many aspects of immediate sampling. In particular: 



PL techniques like boosting (Schapire and Singer 1999) and bagging (Breiman 19941 can be 



used in (re)using old samples before forming new ones. 
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• Variants of active learninj^can be used to set and update h. Some aspect of this are discussed 
in Sec. [H below. 

• Cross-validation is directly applicable in many ways: In our context, the curse of dimen- 
sionality arises if =S is very large. We can address this the conventional PL way, by adding 
a regularization function of qg to the objective function. The parameters controlling this 
regularization can be updated dynamically, as new data is generated, using cross-validation 

To use cross-validation this way, one forms multiple partitions of the current data. For each 
such partition, one calculates the optimal qg for the training subset of that partition. One 
then examines error on the validation subset of that partition. More precisely, one calculates 
the unregularized objective value on the held-out data. 

• More generally, we can use cross-validation to dynamically update any parameters of the 
immediate sampling algorithm. For example, we can update the 'temperature' parameter (3 
of the Boltzmann distribution, arising in both qp and pq KL distance, this way. 

Note that doing this does not involve making more calls to the oracle. 

• We can also use cross-validation to choose the best model class (parametrization) for qg, among 
several candidates. 



As an alternative to all these uses of cross-validation, one can use stacking to dynamically 
combine differrent temperatures, different parametrized density functions, and so on. 



One may also be able to apply kernel methods to do the density estimation (see Macready 
20051. 



5. Experiments 

In this section, we demonstrate the application of PL and immediate-sampling PC tech- 
niques to the unconstrained optimization of continuous functions, both deterministic and 
nondeterministic. We first describe our choice of J^'g?, in this case pq KL distance. Next, 
as an illustrative example, we apply immediate sampling to the simplest of optimization 
problems, where the objective is a 2-D quadratic. Subsequently, we apply it to determinis- 
tic and stochastic versions of two well-known unconstrained optimization benchmarks, the 
Rosenbrock function and the Woods function. 

We highlight the use of PL techniques to enhance optimizer performance on these 
benchmark problems. In particular, we show that cross-validation for regularization yields 
a performance improvement of an order of magnitude. We then show that cross-validation 
for model-selection results in improved performance, especially in the early stages of the 
algorithm. We also show that bagging can yield significant improvements in performance. 

5.1 Minimizing pq KL Distance 

Recall that the integral form of pq KL distance is 

KL(p|lg) = j dxp{x)\n[P-^^ . 

It is easy to show that when there are no restrictions on q being a parametrized density, 
pq KL distance is minimized li p — q. However, owing to sampling considerations, we 



10. Active learning in the precise machine learning sense uses current data to decide on a new query x to 
feed to the oracle. We use the term more loosely here, to refer to any scheme for using current data to 
dynamically modify a process for generating for future queries. 
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usually choose q to be some parametric distribution qg. In this case, we want to find the 
parameter vector 9 that minimizes KL(p||(7£/). Since the target distribution p is derived 
purely from and is independent of qg, minimizing pq KL distance is equivalent to the 
following cross-entropy minimization problem. 

minimize — J dxp{x) In {q{x)) , 

subject to Jdxq{x) — 1, (16) 
q{x) > Vx. 

5.1.1 Gaussian Densities 



If q is log-concave in its parameters 9, the minimization problem ( 16 1 is a convex opti- 
mization problem. In particular, consider the case where X — M", and qg is a multivariate 
Gaussian density, with mean fi and covariance S, parametrized as follows, 

I^M-) - (2,)„/2|s|i/2^-P [ 2 

then the optimal parameters are given by matching first and second moments of p and qg . 

fj* = J dxxp{x), 

T.* = J dx{x- fi*)ix- fi*yp{x). 

It is easy to generalize this to the case where X C M", by making a suitable modification 
to the definition of p. This is described in Sec. |5.2.1[ 

5.1.2 Immediate Sampling with a Single Gaussian 

Using importance sampling, we can convert the cross-entropy integral in Eq. [T6]to a sum 
over data points, as follows. 

where D is the data set {(a;*, g^)},i — 1, . . . , N . This sets up the minimization problem for 
immediate sampling for pq KL distance. 

minimize — — In . (17) 

J-) "^l-^ ) 



Denote the likelihood ratios by = p{x^) / h{x^) . Differentiating Eq. 17 with respect to the 
parameters fi and and setting them to zero yieldi^ 



M = 



Y^^s\x^~^Ji*){x^~^^*)^ 



II. Remarks: 

1. As expected, these formulae, in the infinite-data limit, are identical to the moment-matching results 
for the full-blown integral case. 

2. The formulffi resemble those for MAP density estimation, often used in supervised learning to find 
the MAP parameters of a distribution from a set of samples. The difference in this case is that each 
sample point is weighted by the likelihood ratio s*, and is equivalent to 'converting' samples from h 
to samples from p. 
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5.1.3 Mixture Models 



The single Gaussian is a fairly restrictive class of models. Mixture models can significantly 
improve flexibility, but at the cost of convexity of the KL distance minimization problem. 
However, a plethora of techniques from supervized learning, in particular the Expectation 
Maximization (EM) algorithm, can be applied with minor modifications. 

Suppose qg is a mixture of M Gaussians, that is, 9 — S, (f)) where is the mixing 
p.m.f, we can view the problem as one where a hidden variable z decides which mixture 
component each sample is drawn from. We then have the optimization problem 



mmimize — > . ; /, In (^qg{x\ z')) 



Following the standard EM procedure, we multiply and divide the quantity inside the 
logarithm by some Qi{z^), where Qi is a distribution over the possible values of z*. As 
before, let be the likelihood ratio of the i'th sample. 



minimize — In j Qi{z^) - 

D \ 



•e{x\z^) 



Then using Jensen's inequality, we can take Qi outside the logarithm to get a lower bound. 
To make this lower bound tight, choose Qi{z'^) to be the constant p(z*|a;*). Finally, differ- 
entiating with respect to fij,Ti~^ and 4>j gives us the EM-like algorithm: 

E-step: For each i, set Qi{z^) = 

that is, w] = jIx'), j = l,...,M. 



M-step: Set (ij 



D "'r 



i 



Since this is a nonconvex problem, one typically runs the algorithm multiple times with 
random initializations of the parameters. 



5.2 Implementation Details 

In this section we describe the implementation details of an iterative immediate-sampling 
PC algorithm that uses the Gaussian mixture models described in the previous section to 
minimize pq KL distance to a Boltzmann target parametrized by p. We also describe the 
modification of a variety of techniques from parmetric learning that significantly improve 
performance of this algorithm. An overview of the procedure is presented in Algorithm [l] 

5.2.1 Example: Quadratic G{x) 

Consider the 2-D box X = {x dM.'^ \ \\x\\oc < 1}. Consider a simple quadratic on X, 

Gq{x) = xl+ x\ + XiX2, X <E X. 

The surface and contours of this simple quadratic on X are shown in Fig. [T] Also shown 
are the corresponding Boltzmann target distributions on X, for /3 = 2, 10 and 50. As 



20 



Algorithm 1 Overview of pq minimization using Gaussian mixtures 
1: Draw uniform random samples on X 
2: Initialize regularization parameter /? 
3: Compute G{x) values for those samples 
4: repeat 

5: Find a mixture distribution qg to minimize sampled pq KL distance 
6: Sample from qg 
7: Compute G{x) for those samples 
8: Update j3 
9: until Termination 
10: Sample final qg to get solution(s). 



can be seen, as (3 increases, places increasing probability mass near the optimum of 
G{x), leading to progressively lower ¥.pf3G{x). Also note that since G{x) is a quadratic, 
p^{x) oc exp{—(3G{x)) is a Gaussian, restricted to X and renormalized. We now 'fit' a 
Gaussian density qg to the Boltzmann p^ by minimizing KL(p''||(7e), for a sequence of 
increasing values of f3. Note that qg is a distribution over K^, and Gq is not defined 
everywhere in M^. Therefore, we extend the definition of Gq to all of as follows. 



XiX2, 



xex. 

otherwise. 



Now p^ = for all X ^ X, and the integral for KL distance can be reduced to an integral 
over X. This means that samples outside X are not considered in our computations. 



5.2.2 Constant /? 

First, we fix /3 = 5, and run a few iterations of the PC 
algorithm. To start with, we draw Nj = 30 samples from 
the uniform distribution on X. The best- fit Gaussian is 
computed using the immediate sampling procedure out- 
lined in the preceding section. At each successive itera- 
tion, Nj = 30 more samples are drawn from the current qg 
and the algorithm proceeds. A total of 6 such iterations 
are performed. The 90% confidence ellipsoids correspond- 
ing to p^ (heavy line) and the iterates of qg (thin line) are 
shown in Fig. |2] Also shown are the corresponding values 
of EiqgG{x), computed using the sample mean of Gq{x) 
for 1000 samples of x drawn from each qg, and ¥Jj{p^\\qg), 
computed as the sample mean of \ii{p^ {x) / qg{xy) for 1000 
samples of x drawn according to p^ . 



5.2.3 Varying /3 




Figure 1: Quadratic G[x) 
and associated 
Gaussian targets 



Next, we change (3 between iterations, in the 'update 
step shown in algorithm ([T]). With the same algorithm pa- 
rameters, we start with (3 — 10, and at each iteration, we 

use a multiplicative update rule (3 <— for some constant fc^ > 1, in this case, 1.5. As 
the algorithm progresses, the increasing (3 causes the target density p^ to place increasing 
probability mass on regions with low G{x), as shown in Fig. [l] Since the distributions qg 
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Figure 2: PC iterations for quadratic G{x). 



are best-fits to p, successive iterations will generate lower EggG(x). The 90% confidence 
ellipsoids and evolution of EqgG{x) and KL distance are shown in Fig. [2] 

5.2.4 Cross-validation to Schedule /? 

In more complex problems, it may be difficult to find a good value for the f3 update ratio 
kf^. However, we note that the objective KL(p^||(je) can be viewed as a regularized version 
of the original objective, Eg^, [G(a;)]. Therefore, we use the standard PL technique of cross- 
validation to pick the regularization parameter (3 from some set {/?}. At each iteration, 
we partition the data set D into training and test data sets Dt and Dy- Then, for each 
(3 G {/3}, we find the optimal parameters 0*{P) using only the training data Dt- Next, we 
test the associated qe*(i3) on the test data Dy using the following performance measure. 



E 



qe{x')G{x') 



W) 



E 



h{x^) 



(18) 
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The objective g{6) is an estimatcj^ of the unregularized objective £^^[0(2:)]. Finally, we 
set /3* = argmin^g{^} 5(6'*(/3)), and compute 9*{f3*) using all the data D. Note that the 
whole cross-validation procedure is carried out without any more calls to the oracle ^. 

We demonstrate the functioning of cross-validation on the well-known Rosenbrock prob- 
lem in two dimensions, given by 

Gr{x) ^ 100(2:2 - x\f + (1 - Xif, 

over the region 

X = {a; e I ii^ii^ ^ rj,^^ 

imum value of is achieved at x = (1, 1). 
The details of the cross-validation algorithm used are presented in Algorithm [2j For this 

Algorithm 2 Cross-validation for (3. 

Initialize interval extension count extlter = 0, and maxExtlter and /Jo- 
repeat 

Ax 13 = Po, consider the interval A/3 = [kiPo, k2Po]- 
Choose 1/3} be a set of equally-spaced points in A/3. 
Partition the data into K random disjoint subsets, 
for each fold k, do 

Training data is the union of all but the /c*^ data partitions. 

Test data is the k*^ partition. 

for /3i in {/?}, do 

Use training data to compute optimal parameters 6*{Pi, Dt^^). 
Use test data to compute held-out performance g{6*{(3i, Dy^)), from Eq. 

end for 
end for 

Compute average held-out performance, g{(3i), oig{9*{Pi,Dvf,))- 
Fit a quadratic Q{P) in a least-squares sense to the data {I3i,g{l3i)). 
if Q is convex then 

Set optimum regularization parameter /3* = argmin^gA/3 Q(/5)- 
else 

Fit a line L(I3) in a least-squares sense to the data {(3i,g{f3i)). 
Choose P* = argmin^gA^ L(/3). 
end if 

Increment extlter 
Update /3o ^ (3* 
until extlter > maxExtlter or Q is convex. 
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experiment, we choose 



maxExtlter = 4, 

N,^ 10, 



fci = 0.5, 

12(3 = 5, 



fc2 = 

K = 



2, 
10. 



The histories of EqG{x) and (3 are shown in Fig. |3] Also shown are plots of the fitted 
Q{(3) at iterations 8 and 15. As can be seen, the value of (3 sometimes decreases from one 

12. The reason for dividing by the sum of q{x^) / h{x^) is as follows. If the training data is such that no 
probability mass is placed on the test data, the numerator of g,^ is 0, regardless of the param eters of 
qg. In order to avoid this peculiar problem, we divide by the sum of q{x^)/h{x'), as desribed by 
and CasellalpOOil). 



Robert 



23 



CrosE-validatior for lQg[E(G)] Mietary. 



Cross-validation for p: log{^) hlistorv- 
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(a) EqG{x) history. 




10 15 20 

iteration 



(b) P history. 



CrosE-validatior for [i: E(G) on Held-out Set 



CroES-vslidatior for |^: E(G] on hiald-out Set 





(c) Fit Q{I3), iteration 8. 



(d) Fit QiP), iteration 15. 



Figure 3: Cross-validation for /?: 2-D Rosenbrock G{x). 



iteration to the next, which can never happen in any fixed multiplicative update scheme. 

We now demonstrate the need for an automated regularization scheme, on another 
well-known test problem in M"*, the Woods problem, given by 



G 



woods 



100(X2 - Xi)^ + Xif + 90(X4 - xjf + (1 - ^3)2 

+ 10.1[(1 - X2f + (1 - X^y-] + 19.8(1 - X2){1 - X4). 



The optimum value of is achieved atx=(l,l,l,l). We run the PC algorithm 50 times 
with cross-validation for regularization. For this experiment, we used a single Gaussian q, 
and set 

maxExtlter =4, fci = 0.5, k2 = 3, 

Nj = 20, np= 5, K ^ 10. 

From these results, we then attempt to find the best-fit multiplicative update rule for 
(3, only to find that the average /3 variation is not at all well-approximated by any fixed 
update P <— This poor fit is shown in Fig. |4] where we show a least-squares fit to 

both f3 and log(/3). In the fit to log(/3) the final (3 is off by over 100%, and in the fit to 
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(a) Least-squares fit to /3: 



(b) Least-squares fit to log(/3): 



Figure 4: Best-fit /? update rule. 



/3, the initial /? is off by several orders of magnitude. We then compare the performance 
of cross-validation to that of PC algorithm using the fixed update rule derived from the 
best least-squares fit to log(/3). From a comparison over 50 runs, we see that using this 
best-fit update rule performs extremely poorly - cross-validation yields an improvement in 
final EqgG{x) by over an order of magnitude, as shown in Fig. [s] 



Cross-validation for Roguiarization: Woods Problem. 




20 3D 
Iteration 



(a) log(EqG) history. 



Figure 5: Cross-validation beats best-fit fixed (3 update: 4-D Woods G{x). 



5.2.5 Bagging 

While regularization is a method to decrease bias, bagging is a well-known variance-reducing 
technique. Bagging is easily incorporated in our algorithm. Suppose, at some stage in the 
algorithm, we have N samples we resample our existing data set exactly N times 

with replacement. This gives us a different set of data set D', which also contains some 
duplicates. We compute optimal parameters 6*{D'). We repeat this resampling process fc^ 
times and uniformly average the resulting optimal densities qe*(D'f^)i ^ = 1, • ■ • , ^6- 
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We demonstrate this procedure, using the Rosenbrock function and a single Gaussian 
qg. In this experiment, we also demonstrate the ability of PC to handle non-deterministic 
oracles by adding uniform random noise to every function evaluation, that is, {g \ x,G) ^ 
U[—0.25, 0.25]. For this experiment, Nj — 20, fcf, = 5. The /? update is performed using the 
same cross-validation algorithm described above. Fig. |6]shows the results of 50 runs of the 
PC algorithm with and without bagging. We see that bagging finds better solutions, and 




Figure 6: Bagging improves performance: Noisy 2-D Rosenbrock. 

moreover, it reduces the variance between runs. Note that the way we use bagging, we are 
only assured of improved variance for a single MC estimation at a given 9, and not over 
the whole MCO process of searching over 9. 

5.2.6 Cross-validation for Regularization and Model Selection 

In many problems like the Rosenbrock, a single Gaussian is a poor fit to for many values 
of (3. In these cases, we can use a mixture of Gaussians to obtain a better fit to p^. We 
now describe the use of cross-validation to pick the number of components in the mixture 
model. We use an algorithm very similar to the one described for regularization. In these 
experiments, we use a greedy algorithm to search over the joint space of /3 and models: 

1. We first pick the regularization parameter using Algorithm [2] 

2. For that (3, we use Algorithm [3] to pick the number of mixture components. 

For this experiment, the details are the same as the preceding section, but without bagging. 
The set of models {{0}} is the set of Gaussian mixtures with one, two or three mixing 
components. Fig.|7]shows the variation of Eg(G') vs. iteration. The mixture model is much 
quicker to yield lower expected G, because the Boltzmann at many values of (3 is better 
approximated by a mixture of Gaussians. However, note that the mixture models performs 
poorly towards the end of the run. The reason for this is as follows: No shape regularization 
was used during the EM procedure. This means that the algorithm often samples from 
nearly degenerate Gaussians. These 'strange' sample sets hurt the subsequent performance 
of importance sampling, and hence of the associated MCO problem. This can be alleviated 
by using some form of shape regularization in the EM algorithm. 



26 



Algorithm 3 Cross-validation for model selection. 



Initialize set {{</>}} of model classes {(j)} to search over. 
Partition the data into K disjoint subsets, 
for each fold k, do 

Training data is all but the k^^ data partitions. 
Test data is the k^^ data partition, 
for in {{(/)}} do 

Compute the optimal parameter set 6*{Dt,.) G 
Compute held-out performance g{9* (Dv^)) 
end for 

Compute the sample held-out performance, g{{(pi}), from Eq. 18 
end for 

Choose best model class {</>*} = argmin,^. g{{(j)i}). 



Cross-validation for Modal-selsction: 2-D Rosenbrocl^. 
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Figure 7: Cross-validation for regularization and model-selection: 2-D penalty function 
G{x). 



6. Fit-based Monte Carlo 

Thus far, wc have not exploited the locations of the samples in constructing esimates. In 
this section, we discuss the incorporation of sample locations to improve both MC and 
MCO. 



6.1 Fit-based MC Estimation of Integrals 



We first consider MC estimation of an integral, presented at the beginning of Sec. 2.3 
Recall from that discussion, that to accord with MCO notation, we write the integral to 
be estimated as -Sf (0) — J dw U{w, (p) for some fixed 0. In this notation the sampling of v 
provides a sample set {{w^, U{w^, (f))) -.1 = 1,... N). The associated sum ^{{w\u (w\<j)))} = 
^{4>) then serves as our estimate of the integral ^ = ^{(j>). 

In forming the estimate ^{(w' ,u {w\4>))} we do not exploit the relationships between the 
locations of the sample points and the associated values of the integrand. Indeed, since 
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those locations {w^} do not appear directly in that estimate, ^{{w\u {w\ct>))} is unchanged 
even if those locations changed in such a way that the values {r*} stayed the same. 

The idea behind fit-based (FB) Monte Carlo is to leverage the location data to replace 
■^{{w\u{w\(t>))} with a more accurate estimate of The most straightforward FB MC 
treats the sample pairs U{w'^, (f>)) : i — 1, . . . N)} as a training set for a supervised- 

learning algorithm. Running such an algorithm produces a fit U{.,(j)) taking w's into M. 
This fit is an estimate of the actual oracle U{,(j)), and this fit defines an estimate for the 
full integral, 



•^{u,', [/(«;•, 0)} = J dwU{w,(t)) (19) 

We will sometimes omit the subscript and just write ^{(j)) or even just .5f . In this most 
straightforward version of FB MC, we use ^ as our estimate of ^ rather than ^ . 

In some circumstances one can evaluate ^ in closed form. A recent paper reviewing 
some work on how to do this with Gaussian processes is given by |Rasmussen and Gharamani| 



(2003). In other circumstances one can form low-order approximations to .if, for example 



using Laplace approximations (see Robert and Casella 2004). Alternatively, conventional 



deterministic grid-based approximation of the integral ^ can be cast as a degenerate 
version of closed- form fit-based estimatioiJH] of J^. 

More generally, one can form an approximation to the integral J^{<j}) by MC sampling 
of U{.,(f)). Generating these fictitious samples of U{.,(t>) does not incur the expense of 
calling the actual oracle U{.,(j>). So, in this approach, we run MC twice. The first time, 
we generate the factual samples {(w*, U{w^, (f>)) : i — 1, . . . N)}. Given those samples, we 
form the fit to them, U{., (f). We then run a second MC process using the fictitious oracle 
C/(.,0). 

Note that in all of these approaches, the original sampling distribution v does not 
directly arise, that is, the values {^(w*)} do not arise. In particular, if one were to 
change those values without changing the factual sample locations {w'}, then the esti- 
mate -Sf{(iu',c/(ui\0))} would not change. Of course, a different v would result in a different 
sample set, and thereby a different estimate, but given a sample set, the sampling distribu- 
tion is immaterial. This is typically the case with FB MC estimators, and it contrasts with 
the estimator -^{(w^u (w^ ,<)>))} i which would change if v were changed without changing the 
factual sample locations. 

Note also that the factual samples underlying the fit C/(.,0) are exact samples of the 
factual oracle, U{.,(j)). In contrast, since in general U{.,(t)) ^ U{.,(j)), the fictitious samples 
will be erroneous, if viewed as samples of U{., (j)). Since we are ultimately concerned with an 
integral of [/(., (/i), this suggests that the fictitious samples should be weighted less than the 
factual samples. This might be the case even if one had infinitely many fictitious samples. 
In fact, even if one could evaluate in closed form, it might make sense not to use it 
directly as our final estimate of ^ . Instead, combining it with the importance-sampling 
estimate might improve the estimate. 



13. To see this, modify the MC process to be sampling without replacement. Choose the proposal distribution 
V for this process to be a sum of delta functions. The centers of those delta functions give the points on 
a regular grid of points in the space of allowed a;'s. Have the number of samples equal the number of 
such grid points. Finally, have our fit to the samples, U{.,(f)), be a sum of step- wise constant functions, 
going through the sample points. The closed form integral of that fit given by Eq.[l9]is just the Reimann 
approximation to the original integral, J dw U{w,cf)). 
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6.2 Bayesian Fit-based MCO 

We now extend the discussion to MCO by allowing (j> to vary. As with the case of FB MC 
estimation of an integral, the most straightforward version of FB MCO uses ^{(f>) rather 
than as our estimate of ^ {(/))■ This means that we use argmin0[^((/))] rather than 

aigmm^[j!?{(j))] as our estimate of the (j> optimizing .5f(0). 

^{(f>) is a sum, whereas J^((/>) is an integral. This means that different algorithms are 
required to find the optimizing them. Indeed, optimizing J^{(l)) is formally the same 
type of problem as optimizing ^(0); both functions of 4> are parametrized integrals over 
w. So if needed, we can use PLMCO techniques to optimize Again, as with MC, the 

integrand of ^{4>) is not the factual oracle. So, minimizing ^{(f) using PLMCO sampling 
techniques will not require making additional calls to the factual oracle. 

Consider a Bayesian approach to forming the fit. Our problem is to solve the MCO 
problem (P2), given that the factual oracle U is not known, and we only can generate 
samples of U . Adopting a fully Bayesian perspective, since U is not known, we must 
treat it as a random variable. So we have a posterior distribution over all possible oracles, 
reflecting all the data we have concerning the factual oracle. We then use that posterior to 
try to solve (P2). 

Say that in the usual way that our data contains the w's and the associated functions 
U{'w, .) of a sample set that was generated by importance samphng U (see Sec. [2|. More 
generally, we may have additional data, for instance, the gradients of U at the sample points. 
For simplicity though, we restrict attention to the case where the provided information is 
only the sample set of functions, {w\r'(.)}, together with v. 

We will use D to refer to a sample set for MC or MCO, and for immediate sampling in 
particular. So we write our posterior over oracle^^as P{Uc \ D). Using this notation, the 
goal in Bayesian FB MCO is to exploit P{Uc \ D) to improve our estimate of the (j) that 
minimizes J dw U{w,(j)). 

How should we use P{Uc \ D) to estimate the solution to (P2)? Bayesian decision 
theory tells us to minimize posterior expected loss, / dUc P{Uc \ D)L{(l),Uc)- Given the 
loss function of Eq. |4| that means we wish to solve 

(P6): min / dUc P{Uc \ D)L{(f>,Ur) = min / dw^dUc P{Uc \ D)Uc{wc,<j))- 

To avoid confusion, the variable of integration is written as Wc, to distinguish it from w's in 
the integral J dw U{w, 0). The solution to (P6) is our best possible guess of the (p solving 
problem (P2), given the sample set D. Finding that solution is a problem of minimizing a 
parametrized integral. 

Sometimes we may be able to solve (P6) in closed form, even when we cannot solve 
(P2) in closed form. Performing the integral over Uc may simplify the remaining integral 
over Wc- More generally, we can address (P6) using MCO techniques, and in particular 
using PLMCO 13 

To solve (P6) with PLMCO one generates fictitious samples by sampling one distri- 
bution over Uc's and one over Wc's. This MC process does not involve calls to the actual 
oracle U, but samples a new distribution over C/c's, to generate counter-factual UcS, and 
then samples those Uc's. 

14. Practically, when running a computer experiment, U is the actual oracle generating D according to a 
likelihood P{D \ U). On the other hand, the posterior P{Uc \ D) reflects both that likelihood and a 
prior P{Uc) assumed by the algorithm. So, Uc is a random variable, whereas U refers to the single true 
factual oracle. 

15. To see that explicitly, rewrite the integral in (P2) as J dz V[z,(j)), and identify values of z with pairs 
(wc, Uc), while taking V{z, (f) = V{wc, Uc, <!>) = P{Uc \ D)Uc(wc, <j}). 
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6.3 Example: Fit-based Immediate Sampling 

To illustrate the foregoing we consider the variant of MCO given by immediate sampling 
with a noise-free oracle. In the simple version of MCO considered just above, the estimate 
we make for (j) has no effect on what points are chosen for any future calls we might make to 
the oracle. For simplicity, we restrict attention to the analogous formulation of immediate 
sampling. Using immediate sampling terminology, this means that we only consider the 
issue of how best to estimate 9 after the immediate sampling algorithm has exhausted all 
its calls to the oracle. We do not consider the more general active learning issue, of how 
best to estimate Q when this estimate will be affect further calls to the oracle. However see 
Sec. 16.61 below. 

Recall that in immediate sampling, identifying Wc with and (p with 9, Uc{wc,4') 
becomes F{Gc{xc),9). Given a sample set D of {x, G{x)) pairs generated from a noise-free 
factual oracle, our Bayesian optimization problem in immediate sampling is to find the 9 
that minimizes 

E{^gM\D) = Ep^GAD)[J dx,F{G,{xc),qe{xc))] 

dx.dG, P{G, I D)F{G,{x,),qe{x,)) 
= .^d{9). (20) 
Contrast this with Eq. |11| In particular, the inner integral in Eq. [20 runs over fictitious 
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G is the factual 



oracles Gc that are generated according to P{Gc \ D), whereas in Eq. 
oracle. 

In some circumstances one can evaluate the integral in Eq. [20] algebraically, to give 
a closed form function of 9. In other cases, we can algebraically evaluate an accurate 
low-order approximation, to again give a closed form function of 9. For the rest of this 
subsection, however, we consider the situation where neither of these possibilities hold. 

To address this situation, we approximate the integral in Eq. 20 using importance 
sampling. However, to do this, we may have to importance-sample over two domains. The 
first sampling is over sample locations Xc, using some sampling distribution he- (As an 
example, we can simply choose = h.) The second sampling is over possible oracles G^, 
using some sampling distribution H . More precisely, write 

E{^GAd)\D) = I dx.dG, [h,{x,)H{Gc)\ , F{G,{xc),qe{xc))- (21) 

To approximate this integral generate Nt locations, {x].}, by sampling he- This gives us 
Nt integrals 

TD,i.^j{e) = J dG, g(Gc) ^^^;j^^^ f(Gc«),g.«)). (22) 

Sometimes, these integrals also can be evaluated algebraically, giving closed form sample 
functions of 9. As an example, suppose we sample oracles according to the posterior, that 
is, take H{Gc) = P{Gc \D),so that 

TDA^^^m = -^-l-y j dG, P{G, I D)F{GAx',),qe{xl)). (23) 



Next, say that we have a Gaussian process prior over oracles, P{Gc) ( [Rasmussen and" 



Williams 2006), with a Gaussian covariance kernel. For this choice, and for some F's, 
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we can compute T^ i^^ij^O) exactly for any 6, provided D is not too large. For example, 
this is the case for most F^s whose dependence on their first argument lies in the expo- 
nential familyp^ In other situations, while we cannot evaluate them exactly, the integrals 
To can be accurately approximated algebraically. This again reduces them to closed 

form sample functions of 0. 

In either of these cases, there is no need to sample H; samples of he suffice. More 
generally, we may not be able to evaluate the Nt functions Tq {2,i}(6') algebraically, and 
also cannot form accurate low-order algebraic approximations to them. In this situation, 
for each x*, we should generate one sample of Gc from i7(Gc)j^ This provides us a total 
of Nt sample functions 

As an example, say we believe firmly that a particular posterior P{Gc \ D) governs our 
problem, and can sample that posterior. Then we can take H{Gc) to equal P{Gc \ D). 



Now Eq. 24 only requires that we have values of our sampled Gc's at the {xl}, that is, 
we only need to have values G'*(x*), where GJ, ^ H{Gc)- So we only need to sample the 
Nt separate one-dimensional distributions {P(Gc(a;^) | D)}. In particular, we might be 
able to use Gaussian Process techniques to generate those values. Alternatively, as a rough 
approximation, one could simply fit a regression to the data in D, u}{x), and then add noise 
to the vector of values {w(a::* )} to get {G* (a;* )}. If we wanted to have multiple Gc's for each 
then we would simply generate more samples of each distribution {P{Gc{x\) \ -D)}j^ 
However we sample H , the resultant sample functions provide the estimate 

■^D.{xl,Gl}i(^) - l^z2^D,{xl,Gi}{&) 
Nt^^ 

« n-^DAK-Gim\D) (25) 

which we sometimes abbreviate as For the situation where we can evaluate the 

integral over Gc algebraically (or at least approximate it that way), we instead define 

^ 2 = 1 

and rely on the context to decide which of the definitions of ^d{(^) is meant. So for either 
case we can write J^d{&) ~ .^d(6>)[^ 



16. Strictly speaking, since the oracle is noise-free, the likelihood P(D \ Gc) is a delta function about 
having D lie exactly on the function Gc{x). In practice, this may make the computation be ill-behaved 
numerically. Typically such problems are addressed modeling the fictitious oracles as though the values 
they returned had a small amount of Gaussian noise added. 

17. One could have the number of samples of H{Gc) not match the number of samples of hc(x). To avoid 
the associated notational overhead, here we just match up the two types of samples, one-to-one. 

18. As an alternative, we could reverse the sampling order and sample P(Gc \ D) first and he second. 
Practically, this would mean generating Nt samples of he, and then sampling the single A^T-dimensional 
distribution P{Gc{xl), . . . Gdx^''^) \ D). (This contrasts to the case considered in the text in which H 
is sampled second, so one instead samples Nt separate one-dimensional distributions {P{Gc{x]?) \ D)}.) 
If we do this using a 'rough approximation' based on a fit to D, the noise values added to the values 
of the fit, [uoix].)}, would have to be correlated with each other, since they refiect the same Gc- 

19. As a practical issue, we may want to divide the sum in Eq. 26 by the empirical average X]iI?L ^h'lt^) ■ 
Similarly, if we cannot evaluate the integral over Gc algebraically, we may want to modify the estimate 
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In both cases, under naive MCO, we search for the that minimizes ,'^d{0). We 
then use that as our estimate for the solution to (P6). More generally, rather than use 
naive MCO we can exploit our sample functions with PLMCO. For example, rather than 

minimizing ^u{9), we could minimize a sum of {•^]j{d) and a regularization penalty term. 

However we arrive at our (estimated) optimal qg, most simplistically, we can update h 
to equal that new qg. In a more sophisticated approach, we could set h from the sample 
functions using active learning (see Sec. |6.6| below). Once we have that new h we can form 
samples of it to generate new factual sample locations x. These in turn are fed to the 
factual oracle G to augment our data set D. Then the process repeats. 

Note that unlike with non-FB immediate sampling, with FB immediate sampling we 
need to evaluate P{Gc \ D) (or sample it, if we choose to have H{Gc) — P{Gc \ D)). 
This may be non-trivial. On the other hand, that very same distribution P{Gc \ D) that 
may cause difficulty also gives the major advantage of the fit-based approach; it allows any 
insights we have into how to fit a curve Gc to the data points D to be exploited. 



6.4 Exploiting FB Immediate Sampling 

To illustrate how fitting might improve immediate sampling, consider the case where 
■^oile) is qp KL distance. Say that G{x) is a high-dimensional convex paraboloid in- 
side a hypercube, and zero outside of that hypercube. Suppose as well that we have a 
single factual sampling distribution h, which is concentrated on one side of the paraboloid. 
For example, if the peak of the paraboloid is at the origin, h might be a Gaussian (masked 
by the hypercube) whose mean lies several sigmas away from the origin. 

To start, consider importance sampling MC estimation of the integral --^aiQe) for one 
particular 9, without any concern about choosing among 0's. Say that the factual sample 
D isn't too large. Then it is likely that no elements of D are in regions where G reaches 
its lowest values. For such a D, the associated factual estimate 

= ^'"^^""''^^ (27) 

is larger than the actual value, ^cile) (cf- Eq. [T4| . So straightforward importance- 
sampling integral estimation is likely to be badly ofFrj 

Intuitively, the problem is that as far as the factual estimate ^d{8) is concerned, G 
could just as well be a sum of delta functions centered at the x's in D, with low associated 
oracle sample values, as a paraboloid. If G were in fact such a sum, then would be 
correct. However by looking at the {x,G{x)) pairs in D, all of which lie on the same 
paraboloid, such an inference of G appears quite unreasonable. It makes sense to instead 
infer that G is a paraboloid. 

Fitting is a way to formalize (and exploit) such D-based insights. As an example, 
consider using a Bayesian PL algorithm to do the fitting. Typical choices for the prior 
P{Gc) used in PL would result in a posterior P{Gc \ D) that would be far more tightly 
concentrated about the actual G's paraboloid shape than about the sum of delta functions. 
Fitting would automatically reflect this, and thereby produce a better estimate of =^g(^) 
than #d(6I)|^ 



in Eq. 



25 



by dividing by X]i=a HfcMfc^Ii) • Such divisions would accord with tlie analogous division we 



do in our non-FB immediate sampling experiments. 

20. Since importance sampling is unbiased, tliis means its variance is likely to be large. 

21. It might be objected that in a different problem G actually would be the sum of delta functions, not the 
paraboloid. In that case the FB estimate is the one that would be in error. However this possibility is 
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Now we aren't directly concerned with the accuracy of our estimate ^cile) for E^ny 
single 6. We aren't even concerned with the overall accuracy of that estimate for a set of 
9's. Rather we are concerned with the accuracy of the ranking of the 6''s given by those 
estimates. For example, consider naive MCO, under which we choose the minimizing 
,^d{0). Even if all of our estimates (one for each 9) were far from the associated actual 
values, if their signed errors were identical, the naive MCO would perform perfectly. 

In other words, ultimately we are interested in correlations between errors of our esti- 
mates of ^oiQe) for different 0's. (See Sec. |2.3| ) Nonetheless, we might expect that if we 
tend to have large error in our estimates of ^aile) for the 0's, then everything else being 
equal, we would be likely to have large error in the associated estimate of an optimal 6. 



In Sec. 4.5 we exploited the equivalence between PL and MCO to improve upon naive 
MCO. However the parameter in MCO doesn't specify a functional fit to a data set. Ac- 
cordingly, the incorporation of PL into MCO considered in Sec. |4.5| doesn't involve fitting 
a function to D. This is why those PLMCO techniques don't address the issue raised in 
this example; fitting does that. So in full FB MCO, we may use PL in two separate parts 
of the algorithm, both to form the fit to D, and then to use those fits to choose among the 

e's. 



6.5 Statistical Analysis of FB MC 

Before analyzing expected performance of FB MCO, we start with the simpler case of FB 
MC introduced at the beginning of this section. For simplicity we assume that the integral 

can be calculated exactly for any D, so that no fictitious samples arise. 

As discussed in Sec. |2.3| two important properties of an MC estimator of an integral 
^(0) = J dw U{w, <j)) are the sample bias and the sample variance of that estimator. 
Together, these give the expected loss of the estimator under a quadratic loss function, 
conditioned on a fixed oracle U {.,(/)). 

This is just as true for a Bayesian fitting algorithm as for any other. For quadratic loss, 
for sample set D = {w*, U{'w\ 0)}, the Bayesian FB MC prediction for ^ is the posterior 
mean, 



■^D = J dw' dU,{.,cb) U,{w',cl>)P{U,{.,(f>) I D). (28) 
Accordingly, the expected quadratic loss of Bayesian FBMC is 
j dD P{D\v,U{.,m^ -^of = 

j dw^...dw^ n"(^*)[/ dw'U{w\4)) - j dw' dUci,(j))Uc{w',<j))P{Uc{.,(t)) \ D)]^ (29) 

i—l 

where v is the proposal distribution that is IID sampled N times to generate the sample 
set. 

In the usual way one can re-express this expected quadratic loss using a bias-variance 
decomposition. Whereas a conventional importance sample estimator of J dw U{w, </>) is 
unbiased, the Bayesian estimator is biased in general; typically 

J dD P{D\v,U{.,(t>)) ^- (30) 



exactly what the prior P{Gc) addresses; if in fact you have reason to believe that a G that is a sum of 
delta functions is a priori just as likely as a paraboloid G, then that should be reflected in P{Gc)- Doing 
so would in turn make the FB estimate more closely track the non-FB estimate. 
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This bias is a general characteristic of Bayesian estimators. Furthermore, for some functions 
C/(.,0), the Bayesian estimator will both be biased (unlike the factual sample estimator) 
and have higher variance than the factual sample estimator. So for those U {.,(/)), the 
Bayesian estimator has worse bias plus variance. 

In conventional importance sampling estimation of an integral, the sampling distribu- 
tion V is used twice. First it is used to form the sample set. Then, when the sample set 
has been formed, v is used again, to set the denominator values in the ratios giving the 



MC estimate of the integral (cf. Sec. 2.2 1. In contrast, Bayesian FB MC doesn't care what 
V is. P{Uc{-t4') I is independent of the values v{w'^). As mentioned at the beginning of 
this section, this is a typical feature of FB MC estimators. 

This feature does not mean that the sampling distribution is immaterial in FB MC 
however. Even though it does not arise in making the estimate, as Eq. |29 shows, v helps 
determine what the expected loss will be. Indeed, in principle at least, Eq. 29] can be used 
to guide the choice of the sampling distribution for Bayesian FB MC. It can even be used 
this way dynamically, at a midpoint of the sampling process, when one already has some 
samples of U{.,(j)). Such a procedure for using Eq. [29] to set v dynamically amounts to 



what is called 'active learning' in the PL literature (see Freund et al. 1997 Dasgupta and 
"Kalai, 2005). 



We now generalize the foregoing to the case of a non-quadratic loss function L. The 
Bayesian estimator produces the estimate 

= a.rgminp^^[J dUc{.,(t>) P{Uc{;<f>)\D)L[^D, ^uj] (31) 
Given that the factual oracle is U{., (f>), the expected loss with that Bayesian estimator is 

f dw' ...dw'' Y[v{w')L[^{^.^uin.^,M^ f dw' U{w',cb)]. (32) 

i=l 

The expected loss in Eq. [32] is an average over data with the oracle held fixed. This 
contrasts with the analogous quantity typically considered in Bayesian analysis, which is 
an average over oracles with the data held fixed. That quantity is the posterior expected 
loss, 

' dUi; 0)P(C/(., </.) I D)L[^D,^um (33) 

In general, different [/(.,0)'s will give different risks for the same estimator. So we 
can adapt any measure concerning loss in which U {.,(/)) varies, to concern risk instead. In 
particular, the posterior expected risk is 

dU{.,^)P{U{.,cl)) I D) {L[^D,^u{4>)] - inmpML[p,^um]}- (34) 

Often the lower bound on loss is always 0, so that minpgR[L[p, ^(7(0)]] = V U{.,(f>). In 
this case posterior expected risk just equals posterior expected loss. 

We can combine the non-Bayesian and Bayesian analyses, involving expected loss and 
posterior expected loss respectively. To do this we consider the prior-averaged expected 
loss, given by 

TV 

J dU{.,4>) P{U{.,4>)) J dwK..dw'' n«(i«Oi[-5^{(»',c/(«,s0))}, / dw' U{w',^)]. (35) 

i—1 
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where P{U{., cf))) is a prior distribution over oracles. 

Note that the prior-averaged expected loss is an average over both oracles and sample 
sets. It reflects the following experimental test of our FB MCO algorithm: Multiple times a 
factual oracle U{., (f>) is generated by sampling P{U{., (p)). For each such U (., 0), many times 
a factual sample set D is generated by sampling the likelihood P{D \ U{.,(j>),v). That D 
is then used by the FMCO algorithm to calculate J^'o. In performing that calculation, the 
algorithm assumes the same likelihood as was used to generate D, but its prior P{Uc{ .,(!))) 
may not be the same function of Uc{-, (j>) as P{U{., (f>)) is of [/(., (p). Then the loss between 

and is calculated. The quantity in Eq. [35] is the average of that loss. 

Say that P{U{.,(f))) is the same function of U{.,(j)) as P{Uc{-, (j))) is of Uc{-,4>)- Then 
the Bayesian estimator is based on the actual prior. In th is cas e, the Bayesian estimator 

will minimize the prior- averaged expected loss of Eq. 35^^ In general though, there 



is no reason to suppose that these two priors are the same. In the real world where those 
priors differ, expected loss for a Bayesian estimator is given by an inner product between the 
posterior used by tha t estimator, P{Uc{., </>) | D), and the true posterior, P{U{., (p) \ D) (see 
WolpertI [19971 |1996| )p^ 



As before, since ?/(.</)) varies in the integrand of prior-averaged expected loss, we can 
can adapt it to get a prior-averaged expected risk. This is given by 



N 



I dU{.,<l)) P{U {.,<!))) I dw^...dw^ f[v{w') X 

^(0)] - minpeB[L[p, (36) 

As before, if the minimal loss is always 0, then prior-averaged expected risk just equals 
prior-averaged expected loss. 

Broadly speaking, in Bayesian approaches to Monte Carlo problems, the sampling 
distribution that generated the samples is immaterial once one those samples have been 



generated (see Rasmussen and Gharamani 2003 1 and references therein) . So what differ 



ence does the choice of a sampling distribution like v make to a Bayesian? The answer is 
that V determines how likely it is that we will generate a D with a high posterior variance 
of the quantity of interest. For example, say one wishes to form an importance sampling 
estimate of = J d-x U{x) using sampling distribution v to generate sample set D. Then 
if one changes w, one changes the likelihoods of the possible D. Moreover, each D has its 
own posterior variance, Var(^c I ^)- So what a good choice of v means is that a D with 
poor Var(^c | D) is unlikely to be formed, that is, that / dD P{D \ w)Var(^c | D) is low. 

22. To see this, replace Jifn with some arbitrary function of D, f{D). Our task it to solve for the optimal 
/. First interchange the integrals over data and over oracles in Eq. |35[ Next consider the integrand of 
the outer (data) integral, 

f dUi.,<t>) PiUi.,4>))llv{w')L[f{D), f dw'U{w',m. 

Since we are considering a noise-free oracle, we can write this as 

d[/(.,0) P(U{.A))P{D I v,U{.A))L[f{D),J^um. 



Since P{U{.,<f>))P_{D \ v,U{.,(f>)) oc P{U{.,(t)) \ v,D) ^ P{U{.,<f>) \ D), this integral is minimized by 
setting f{D) = .^d- QED. 
23. It is in recognition of the fact that those functions might differ that we have been referring to 'Bayesian' 
rather than 'Bayes-optimal' estimators. 
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6.6 Statistical Analysis of FB MCO 

We can extend the statistical analysis of FB MC to the case of FB MCO by allowing (f) to 
vary. The Bayesian choice of </> is the one that minimizes posterior expected loss, 



^ argmin^[ j dU, P{U, \ D)L{cp, U,)]. (37) 



Since P{Uc | w,!?) = P{Uc \ D), this estimator is independent of v. The same is true for 
the posterior expected loss of this Bayesian estimator, 

dU P{U \ D)L{4>D,U). (38) 

On the other hand, the expected loss associated with this estimator, 

N 

/ dw\ . .dw^ l[v{w')Li4>{^,^u{^,^},U), (39) 
explicitly depends on v. So does the prior-averaged expected loss, 

I dUP{U) I dw\..dw''Jlv{w')L{^{^,^u(n.^)},U). (40) 

i=l 

Next, the posterior expected risk is 

dU P{U\D){L{^D:U) - nim^W.U)]} (41) 



where (j)' runs over the (implicit) set of all possible (f). In general min^/ [L(0', f/)] varies 
with U . (For example, this is the case with the loss function ^i[{(f)) of Eq. [4]) Accordingly, 
unlike in Bayesian FB MC, typically in Bayesian FB MCO the posterior expected risk does 
not equal the posterior expected loss. 

Finally, the prior-averaged expected risk is 

N 

I dUdw^.-dw"" P{U)l[v{w'){L{4,^^,,u(^.)},U) - mm^,[L{cf>',U)]}. (42) 

i—l 

Again, since min^/ [L((j)' , U)] typically varies with U, in general this prior-averaged expected 
risk does not equal the prior-averaged expected loss. However the estimator that minimizes 
prior-averaged expected loss — (j)D — is the same as the estimator that minimizes prior- 
averaged expected riskj^ 

For any particular fitting algorithm, our equations tell us how performance of the asso- 
ciated FB MCO depends on v and either PiU{., (/>)) or the pair P{U{.,(j))) and P(C/c(., (j))), 
depending on which equation we consider. So if we fix those prior(s), our equations tell us, 
formally, what the optimal v is. 

One can consider estimating that optimal w at a mid-way point of the algorithm, based 
on the algorithm's behavior up to that point. One can then set v to that estimate for the 
remainder of the algorithmP^ Doing this essentially amounts to a type of active learning. 
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24. This follows from the fact that the prior-averaged lowest possible risk, the term subtracted in Eq. 
independent of the choice of the estimator. 

25. Note though that if one intends to update v more than once, then strictly speaking the first update to 
V should take into account the fact that the future update will occur. That means the equations above 
for expected loss, prior-averaged expected loss, etc., no longer apply. 
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As with Bayesian FBMC, we can analyze the effects of having P{U) not be the same 
function of U as P{Uc) is of Uc- Since PL and MCO are formally the same, such an 
analysis applies to parametric machine learning in addition to FB MCO. In particular, 
the analysis gives a Bayesian correction to the bias-variance decomposition of supervised 
learning. This correction holds even if the fitting algorithm in the supervised learning 
cannot be cast as Bayes-optimal for some assumed prior P{Uc). Intuitively speaking, the 
correction means that the bias-variance decomposition gets replaced by a bias-variance- 
covariance decomposition. That covariance is between the posterior distribution over target 
functions on the one hand, and the posterior distribution over fits produced by the fitting 
algorithm on the other (see Wolpert 1997). 



6.7 Combining FB and Non-FB Estimates in FB MCO 



Return now to the example in Sec. 6.6 where the factual sample is formed by importance 



sampling the factual oracle and we form a fictitious sample set using fictitious oracles. Then 
using only D, our estimate of =^g(^) would be the factual estimate, ,:^d{&) — jy — 
Using only our fictitious samples would instead give us the estimate (9) . 

On the one extreme, say we firmly believe that distribution we use for the posterior 
P{Gc I D) is correct. (So in particular we firmly believe that the factual oracle G was 
generated by sampling the prior P{Gc)-) Then in the limit Nt oo, V0 our importance- 
sample estimate of will be exactly correct. So Bayesian decision theory would direct 

us to use the associated estimate ^oiP), and ignore ^£){9). At the other extreme, say 
that Nt = 1, while N, the number of factual samples, is quite large. In such a situation, 

even if we believe our posterior is correct, it would cleaxly be wrong to use ^ d{(^) as our 
estimate, ignoring .^d{9). 

How should we combine the estimates in this latter situation? More generally, even 
when we believe our posterior is correct, unless the number of fictitious samples is far greater 
than the number of factual samples, we should combine the two associated estimates. How 

best to do that? Does the fact that ^d{S) is estimated via importance sampling over a 
much larger space than (9) affect how we should combine them? More generally, say we 
don't presume that our P{Gc \ D) is exactly correct; how should we combine the estimates 
then? 

One is tempted to invoke Bayesian reasoning to determine how best to combine the 
two estimates. While that might be possible in certain situations, often determining the 
optimal Bayesian combination would necessitate yet more Monte Carlo sampling of some 
new integrals. It would be nice if some other approach could be used. 



One potential such approach is stacking (Wolpert 1992 Breiman 1996 Smyth and 
Wolpert"! 1999). In this approach, one many times partitions the factual sample D into two 



parts, a 'training set' D^, and a 'validation set' D^. We write the values of w and U in 
as {Dl^{i)} and {£)^(i, 0)} respectively, and similarly for D^. For each such partition 
one would run both the non-FB MCO algorithm and the FB MCO algorithm on D^. That 

generates the estimates (j)v,u,D^ and respectively. 

Those two <j)'s give us two associated error values on the validation set, J^j ^uU' <I>v.u.d^) 
and Duijy 4>d'^)^ respectively. More generally, we can evaluate the error on the the val- 
idation set of any (j), in addition to the errors of (t>v^u,D'^ and c/jjji . Moreover, we can do 
this for the validation set of any of the partitions of D. Note, however, that only factual 
samples are used for cross-validation. 
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This is what stacking exploits. In the most straightforward use of stacking, one searches 
for a function mapping the ^'s produced by our two algorithms to a composite (j). The goal 
is to find such a composite ^ that will have as small validation set error (when averaged 
over all partitions) as possible. 

For example, if is a Euclidean vector, one could perform a regularized search for the 
weighted sum of ^'s that gives minimal partition-averaged validation set error. Let the 
weights produced by that search be bps and bnon-FB- Then to find the final estimate for 
(f), one would use those weights to sum the outputs of the algorithms when run on all of D: 

bnon-FB(l>v,U,D + bpB<t>D- 

7. Conclusion 

In this paper we explored the relationship between Monte Carlo Optimization of a parametrized 
integral, parametric machine learning, and 'blackbox' or 'oracle'-based optimization. We 

made four contributions. 

First, we proved that MCO is identical to a broad class of parametric machine learning 
problems. This should open a new application domain for previously investigated para- 
metric machine learning techniques, to the problem of MCO. 

To test the use of PL in MCO one needs an MCO problem domain. The one we used 
was based on our second contribution, which was the introduction of immediate sampling. 
Immediate sampling is a way to transform an arbitrary blackbox optimization problem into 
an MCO problem. Accordingly, it provides us a way to test the use of PL to improve MCO, 

but testing whether it can improve blackbox optimization. 

In our third contribution we validated this way of improving blackbox optimization. In 
particular, we demonstratied that cross-validation and bagging improve immediate sam- 
pUng. 

Conventional Monte Carlo and MCO procedures ignore some features of the sample 
data. In particular, they ignore the relationship between the sample point locations and 
the associated values of the integrand; only the values of the integrand at those locations 
are considered. We ended by presenting fit-based MCO, which is a way to exploit the 

information in the sample locations. 

There are many PL techniques that should be applicable to immediate sampling but 
that are not experimentally tested in this paper. These include density estimation active 
learning, stacking, kernel-based methods, boosting, etc. Current and future work involves 
experimental tests of the ability of such techniques to improve MCO in general and imme- 
diate sampling in particular. 

Other future work is to conduct experimental investigations of the three techniques 
that we presented in this paper but did not test. One of these is fit-based MCO (and 
fit-based immediate sampling in particular). The other two are the techniques described in 
the appendices: immediate sampling for constrained optimization problems, and immediate 
sampling with elite objective functions. 

There are also many potential application domains for immediate sampling PC for 
blackbox optimization that we intend to explore. Some of these exploit the ability of such 
PC to handle arbitrary (mixed) data types of a;'s. In particular, one such data type is the 
full trajectory of a system through a space; for optimizing a problem over such a space, 
PC becomes a form of reinforcement learning. 
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A. Constrained Optimization 

Under the PC transform we replace an optimization problem over X with one over As 
discussed at the beginning of Sec. |3.3| the characteristics of the transformed objective can 
be very different from those of the original objective. 

Similarly, characteristics of any constraints on X in the original problem can also change 
significantly under this transformation. More precisely, say we add to (P4) equality and 
inequality constraints restricting x G X to a, feasible region. Then to satisfy those X- 
constraints we need to modify (P5) to ensure that the support of the solution q0{.) is a 
subset of the feasible region in X. 

This appendix considers some ways of modifying PC to do this. For earlier work on 



this topic in the context of delayed sampling, see 


Wolpert et al. 


(20061; 


Bieniawski and 


Wolpert 


(2004 


I; 


Bieniawski et al. 


(2004 


I; 


Macready and Wolpert 


( 


2005 


1. 



A.l Guaranteeing Constraints 

Say we have a set of equality and inequality constraints over X. Indicate the feasible region 
by a feasibility indicator function 



1, X is feasible, 
0, otherwise. 



For simplicity, we assume that for any x, we can evaluate <i>(x) essentially 'for free'. 
The transformed version of this constrained optimization problem is 

(P5c) : minimize ^^(ge), 

subject to qg{x)^{x) — 0. 

We now present a parametrization for q that ensures that it has zero support over infeasible 
X. First, let q be any parametrized distribution over X, for instance, a mixture of Gaussians. 
Then using <i>(a; G X) as a 'masking funtion' we parametrize qe^x) as 

^ qe{x)<^{x) 
qe{x) - 



/ dx' q0{x')<^>{x') 
= q^.eix). 

This qg automatically meets the constraints; it places zero probability mass at infeasible 
x's. It transforms the constrained problem (P5)c into the unconstrained problem 

(P5„c) : minimize 

Now consider the case where .^^^ is an integral over X. Typically in this case we are 
only concerned with the values of the associated integrand at feasible x's. For example, 
when Eqg{G) is of interest, it's usually because our ultimate goal is to find a feasible x 
with as good a G{x) as possible. In this situation it makes no sense to choose between 
two candidate qg's based on differences in (the G values at) the regions of infeasible x 
that they emphasize. More formally, our choice between them should be independent of 
their respective values of / [1 — <^[x)]qg{x)G{x). We can enforce this by replacing the 
objective Eqg{G) = J dx qg{x)G{x) with / dx ^{x)qg{x)G{x). If we then use the barrier 
function approach outlined above, our final objective becomes qp KL distance with the 
integral restricted to feasible x's. 

Generalizing this, when we are not interested in behavior at infeasible x we can reduce 
the optimization problem further from (PSuc), by restricting the integral to only run over 
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feasible x's. More precisely, write the original problem (P5c) as the minimization of 



dx[j dgP{g\x,<^)]F{g,q0{x)) = J dx ^j,{x, qe{x)), 

subject to the constraints on the support of qg. By using the q construction we can replace 
this constrained optimization problem with the unconstrained problem 

(Al): argmin^ J dx ^{x)^[x,qg{x)] — argmiug J dx ^{x)fi[x,q,j,^g{x)], 

■ /"^ ^/ ^ r <i>{x)q{x) . 
= argmm, J dx <P{xMx, -^^-^^^-^]. 

As an example, say our original objective function is pq KL distance. Define = 
/ dx p'^{x)^{x). Then our new optimization problem is to minimize over 9 

KL(p§||g*,e) = KL(^||g*) 

_ f ^^'Hx)pHx)_ qe{x)<^{x) , 



Zl 7 dx'qe{x')^{x'y' 
dx M^>!M {\n[qg{x)] + ln[<f>(x)] - ln[ / dx' qg{x')<^{x')]}. 



The qe minimizing this is the same as the one that maximizes 

dx^^^^^j^ln[ge(x)] - H f dx' qe{x')<^{x')]. (43) 

We can estimate Z^ using MC techniques. We can then apply MCO to estimate the 9 that 



maximizes the integral differencerj in Eq. 43 



To generate a sample of sample qg{x) — qe{x)<^{x) we can subsamplep^ qg according to 
In some cases though, this can be very inefficient (that is, one may get many rejections 
before getting a feasible x). To deal with such cases, we can first run a density estimator 
on the samples of feasible a;'s we have so far, getting a distribution tt. (Note that no extra 
calls to the feasibility oracle are needed to do this.) Next write qg{x) = ■n{x)[qg{x) / t:{x)\. 
This identity justifies the generation of samples of qg by first sampling 7r(x) and then 
subsampling according to qg{x) /i:{x) = qg{x)^{x) /'n:{x). 

In an obvious modification to the foregoing, we can replace the hard restriction that 
supp(g) contain only feasible x's, with a 'soft' constraint that q{x) < k V infeasible x. 
A similar alternative is to 'soften' <^{x) by replacing it with n for all infeasible x, for 
some K > 0. For either alternative we anneal k down to 0, as usual, perhaps using cross- 
validation. 



26. Note that in general this difference of integrals wil l not be convex in qe for product distributions, unlike 
KL(p^ II qe). See the discussion at the end of Sec. 3.1 on product distributions and pq distance. 



27. Say we want to sample a distribution A[x) oc B{x)C{x) where B is a distribution and C is non-negative 
definite, with c some upper bound on C. To generate such a sample by 'subsampling B according to C" 
we first generate a random sample of B{.), getting x' . We then toss a coin with bias C{x')/c. If that coin 
comes up heads, we keep x' as our sample of A. Otherwise we repeat the process (see Wolpert et al.| 
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A. 2 Alternative 

Since we're maximizing our expression over qg, the second, correcting integral in Eg. |43| will 
tend to push qg to have probabihty mass away from feasible regions. To understand this 
intuitively, say that qg is a Gaussian and that the feasible region is 'spiky', resembling a 
multi-dimensional star-fish with a large central region and long, thin legs. For this situation, 
if we over-concentrate on keeping most of qg's mass restricted to feasible x, our Gaussian 
will be pushed away from any of the spikes of the feasible x's, and concentrate on the 
center. If the solution to our original optimization problem is in one of those spikes, such 



over-concentration is a fatal flaw. The second integral in Eq. 43 corrects for this potential 
problem. 

More broadly, consider typical case behavior when one applies some particular con- 
strained optimization algorithm to any of the problems in a particular class of optimization 
problems. As a practical matter, there is a spectrum of such problem classes, indexed by 
how difficult it is just to find feasible solutions on typical problems of the class. On the 
one side of this spectrum are problem classes where it is exceedingly difficult to find such a 
solution, e.g., high-dimensional satisfiability problems with a performance measure G su- 
perimposed to compare potential solutions. On the other end are "simple" problem classes 
where it is reasonable to expect to find a feasible solution. The 'starfish' optimization 
problem is an example of a problem of the former typep^ 

For problems on the first side of the spectrum, where just getting a substantial amount 
of probability mass in to t he feasible region is very difficult, we may want to leave out the 



second integral in Eq. 43 In other words, we may want to minimize KL(p5^ || qg) rather 
than KL(p^ || q^.g). The reason to make this change is so that qg won't get pushed away 
from the feasible region. (As an aside, another potential benefit of this change is that if we 
make it, then for product distribution qg, ^cg{.) is convex.) 

Even if we do make this change, when we sample the resultant qg we may not get a 
feasible x. If this happens, a natural approach is to repeatedly sample qg until we do get a 
feasible x. However the resultant distribution of I's is the same as that formed by sampling 
q^^g for the same 9. So under this 'natural approach' we work to optimize a distribution 
{qg) different from the one we ultimately sample (g$.6i). This means that this approach 
may not properly balance our two conflicting needs for qg : that it have most of its support 
in the feasible region, and that it be peaked about x's with high p^{x). 

To illustrate this issue differently, take qg to be normalized, and to avoid multiplying 
and dividing by zero, modify ^{x) to equal some very small non-zero value k for infeasible x 
(as discussed above). Then under this 'compound procedure', we ultimately sample q$^g(.). 
However we do not choose ^c^iqc^.g) — KL(p^ || q^,g) as the function of 9 that we want to 
minimize. Instead we choose 

,) = KL(p^ II q^^g) ln[ J dx' ^1^]. (44) 



A. 3 Using Constraints for Unconstrained Optimization 

Return now to unconstrained optimization problems. Say that we have reason to expect 
that over a particular region R, the distribution p^{x) has values approximately n times as 
small as its value over X\R. It would be nice to reflect this insight in our parametrization 
of q, that is, to parametrize g in a way that makes it easy to match it to p^{x) accurately. 
We can do this using a binary- valued function <i> and the approaches presented above. 

28. Note that since we are discussing typical-case behavior, computational complexity considerations do not 
apply. 
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To illustrate this, define qg,^g as above and choose the objective function = 
KL{p^ II <7$,e), where <i>(a;) = k over R, and equals 1 over X \ Then working through 
the algebra, the qg that minimizes this objective is given by the qg that minimizes 



- / dxp^{x)ln[qg{x)] + ln[ / dx'qg{x')<i>{x')] = KL(/ || qg) + ln[ J dx' qg{x')<i>{x')] 

- KL(/ \\qg) + 

ln[ / dx'ji Kqg{x') + / dx'^^^J^ qe{x')]. 



(45) 



The logarithm on the right-hand side is a 'correction' to pq distance from p^^ to qg, a correc- 
tion that pushes qg away from regions where ^(x) — 1 (assuming k < 1). To use immediate 
sampling with this parametrization scheme, once we find the qg that minimizes the sum 
of pq distance plus that correction term, we would set h to the distribution (proportional 
to) qg{x)^{x). So we would generate our new samples from qg{x)^{x), for example by 
subsamplingpj 



B. The Elite Objective Function 



Not all PC objectives can be cast as an integral transform. Properly speaking, the choice 
of objective should be set by how the final qg will be used. For instance, the concept of 



expected improvement suggested by Mockus et al. ( 1978 1, and used by Jones et al. ( 1998 



considers an objective (to be maximized) given by max(G'bc — G{x),0), where Gbc is the 
best of all the current samples, mmi{G{x'^)}. This means that at each step we will take 
a single sample, and want to maximize the improvement. This is a simplification; even 
though the next sample may yield any improvement, it may be informative, so that we get 
a good sample ten steps later. A less simplistic objective is the following: 

In blackbox optimization, no matter how many calls to the (factual) oracle we make, 
we will ultimately choose the best x (as far as the associated G value is concerned) out of 
all the ones that were fed to the oracle during the course of the entire run. Our true goal 
in BO is to have the G associated with that best x be as small as possible. For a discussion 
of distributions of extremal values, 



see 



Leadbetter et al. (19831; Resnick (19871. 



Given that qg varies over the run in a way that we do not know beforehand, how can 
one approximate this goal as minimizing an objective function that is well-defined at all 
points during the run? One way to do this is to assume that there is some integer N such 
that, simultaneously, 

1. It is likely that the best x will be one of the final N calls to the oracle during the run; 

2. It is likely that qg will not vary much during the generation of those final N samples. 

Under (2) we can approximate the qg's that are used to generate the final N calls as all 
being equal to some canonical qg. Under (1), our goal then becomes finding the canonical 



29. Note that we use in this J?c#, not p^, which is what we used for constrained optimization. This is 
because our goal now is simply to find a qg that matches p'^{x). There are no additional aspects to the 
problem involving feasibility regions that have no a priori relation to G{x). 

30. In practice, ^{x) for this unconstrained case would not be provided by an oracle. Instead we would 
typically have to estimate it. We could do that for example by using a regression to form a fit to samples 
of p^ (x) and then use that regression to define the region R. 
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qg that, when sampled N times, produces a set of x's whose best element is as good as 
possible. 

In this appendix we make some cursory comments about this objective function, which 
we call the elite objective function. We focus on the use of Bayesian FB techniques 
with this objective. For a noise-free oracle the CDF for the elite objective is 

N 

CDF(fc) ^ I- dx^ ...dx^ W[qg{x')Q{G{x') - k)]. (46) 
•' i=i 

So the associated density function is 

dCDF(fc) 



dk 

= Nqe{k)[l dxqe{x)Q{G{x)-k)\^''-^\ (47) 



The associated expectation value, / dk kf{k), is not linear in Re- 
writing it out, the posterior expected best-of-X value returned by the oracle when 
queries are generated by sampling qg is 

„ K K 

dx\.. dx"" H qg{x') / dG P{G \D) dg\.. dg"" J] P{g'' \ x^G)mmk{g''} (48) 
'' 1=1"' ■' j=i 

We want the 6 minimizing this. Say we knew the exact posterior P(G \ D) and could 
evaluate the associated integral in Eq. |48] closed- form. In this case there would be need 
for the parametric machine learning techniques used in the text. In particular there would 
be no need for regularization — an analogous role is played by the prior P{G) underlying 
P{G\D). 

When we cannot evaluate the integral in closed form we must approximate it. To 
illustrate this, as in Sec. |6] for simplicity consider a single-valued oracle G. This reduces 
Eq.|48]to 

j dx^ ... dx^ Jl qg{x') j dG P{G I D)mmk{G{x'')}. (49) 



(The analogous FB MCO equation for objective functions involving a single integral Eq. 20 
here the single 'x' in that equation is replaced with a set of K cc's sampled from qg.) To 
approximate this integral we draw Nt sample-vectors of K x's each, using a sampling 
distribution hc{x) to do so. At the same time we draw Nt fictitious oracles from some 
sampling distribution H over oracles. 



31. An obvious variant of this reasoning is to have A'' vary across the run of the entire algorithm, at any 
iteration t being only the number of remaining calls to the oracle that we presume will be made. In 
this variant, one would modify the elite objective function to only involve the N{t) remaining samples 
whose G value is better than the best found by iteration t. For the case N — 1, this is analogous to the 
expected improvement idea in Jones et al. (19981. Note that this variant objective function will change 



during the run, which may cause stability problems. 
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To simplify notation, let x indicate such a ii'-tuple of x's. (So for multidimensional X, 
X is actually a matrix.) Also write 



K 



i=l 
K 



i=l 

G{x) ^ (G(xi ),..., G(a;^)). (50) 
With this notation, the estimate based on fictitious samples introduced in Sec . [6| become^^ 

^ n ^^^-m..,...HG.(.. )}. (51) 

As discussed in Sec. [6] it is often good to set H{G) to be as close to P{G \ D) as 
possible. So for example if we assume a Gaussian process model, typically we can set 
H{Gi) — P{Gi I D), and then directly sample H to get the values of one Gi at the K 
separate points x^. Alternatively, we can first form a fit 4>{x) to the data in D. Next, 
for each of N samples Xi, sample a colored (correlated) noise process over the K points 
{xi\ to get K real numbers. Finally, add those K numbers to the corresponding values 
{(t){xl) : j = 1, ■ . • , K}. This gives our desired sample of {Gi{xi)}. 

To illustrate the foregoing, suppose K = I, and that we have no regularization on qg. 
Then, in general, the sum in Eq. [51] is minimized by a qg that is a delta function about 
that data point xj with the best associated value Gi{xj) /hc{x}). However for K > 1, even 
without regularization, the optimal qg is not a delta function, in generalj^ In addition 
to the regularization-based argument in the text, this gives a more formal reason why the 
optimal qg should not be infinitely peaked. 

When A' > 1, the peakedness of qg parallels the peakedness of another non-negative 
function over x's, namely P{G : G(x) is minimized at x \ D). However, if we run a few iter- 
ations of FB MCO with the elite objective, then D grows, and so P{G : G{x) is minimized at x \ 
D) gets increasingly peaked over x's. (Intuitively, the larger D is, the more confident we are 
about G, and consequently the more confident we are about what regions of a;'s minimize 
G.) Accordingly, qg gets increasingly peaked as the algorithm progresses. 

Note that this happens even though there is no external annealing schedule. This 
reflects the fact that the elite objective has no hyperparameter or regularization parameter 
like the (3 that appears in both the pq and qp objective functions. 



C. Gaussian Example for Risk Analysis 

The following example illustrates the foregoing for the case of Gaussian tt, where only 
moments of tt up to order 2 matter. 

To illustrate the foregoing, consider the simple case where there are only two 0's, (t>^ 
and 0^. Suppose that U and X are such that tt is a two-dimensional Gaussian. Write tt's 
mean as Say that one of tt's principal axes is parallel to the diagonal line, li = 1-2 (that 



32. In practice there might be more efficient sampfing procedures than Eq. 51 For example, one could form 
NK samples of hdx) and A'^ samples of H{G), to get two sets, which one then subsamples many times, 
to get pairs [x, G{x)]. 

33. This suboptimality of a delta function qe is similar to the suboptimality of having all K pulls in a 
multi-armed bandit problem be pulls of the same arm. 
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is, one of the eigenvectors of tt's covariance matrix is parallel to the diagonal, and one is 
orthogonal to the diagonal) . Write the standard deviation of tt along that diagonal axis as 
a A, and write the standard deviation along the other, orthogonal axis as ctb- 

Since tt's covariance matrix has identical diagonal entries, and since the trace of that 
matrix is preserved under rotations, those entries are both ^[cr^ + cr^]. Since the determi- 
nant is preserved, and since aA is the variance parallel to the diagonal, this in turn means 
that tt's (identical) off-diagonal entries are ^[(T^ — cr^]. The probability that MCO will 
choose (j)^ is the integral of tt over the half- plane where cj)^ < 4>^'- 



Pr(^(02) > i^(0i)) = erf(^^^). (52) 



Next, define 



AL EE ^((/.l) -^(02), 

A6 EE K-i^(0l)] - [M2-i^(<^')] 

= - ^i2] - AL. (53) 

So the difference in the value of the loss function between the two 0's is AL, and the the 
difference in the biases of the two estimators ^{(f)^) and ^{(j)^) is Ab. Note also that the 
variances of the two estimators are the same, 

Var[^(0i)] = Var[^(02)] ^ <±^^ (54) 

So if we shrink the variance of either of the estimators, then we shrink an upper bound on 
as- 

For this case of a fixed set of (j)'s, it is illuminating to consider the difference between 
expected loss under a particular MCO algorithm and minimal expected loss over all (/)'s, 
that is, the risk of the MCO algorithm. Assuming AL < 0, it is given by 

[Pr[i^(</.2) > ^(<^i)] - e[^(^2)_^(^i)]] X [^(^1) _ ^(^2)] 



[erf(^^^-^)-e(A6 + /i2-/ii)] x [M1-M2-A6]. (55) 



Say that A6 = 0. Then Eq. 55 shows that so long as /ii 7^ /i2, as ctb — > risk goes to 
its minimal possible value of zero. So everything else being equal, shrinking the variance 
of either estimator reduces risk, essentially minimizing it. Alternatively, if we leave the 
variances of the two estimators unchanged, but increase their covariance, |[cr^ — cr^], then 
(7a will increase, while ag must shrink. So again, the risk will get reduced. For the more 
general, non-Gaussian case, the high order moments may also come into play. 
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